AMDA dataset retrival and data format conversion tool written in python – README.md

README.md

AMDA data generation module

This is a python module for navigating available data files, retrieving them and converting them to a format ready to be installed in AMDA. AMDA uses the NetCDF file format to store datasets, when installing a new dataset we are often confronted with the task of downloading available files in their native format and converting them to NetCDF. This module implements a collection of usefull functions for performing such operations :

  • amda_data_generator.navigate.htmlindex.HTMLIndex : parses a HTML index page and provides a iterator for navigating through it. Allows the user to provide the extension of the files that interest him.
  • amda_data_generator.get.URLGetter: fetches the data at a given URL and returns it as a FileType object that can be passed to a Converter
  • amda_data_generator.convert.Converter : base class tasked with converting an arbitrary data file to NetCDF. Can be initialized with a mapping describing transformations of the original variables (renaming, converting time to double, etc...). Subclass this object to define your own converters.
    • amda_data_generator.convert.cdf.CDF2NetCDF : converts a CDF file to NetCDF

Installing

Create a virtual environment, activate it and install requirements :

$ python3 -m venv venv
$ . venv/bin/activate
$ python -m pip install -r requirements.txt
$ python -m pip install .

Make sure to run the commands from the same folder containing the amda_data_generator module.

Usage

HTML index iterator

Iterating over WIND 3DP instrument files for 1994 available at https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/ is done like so:

>>> from amda_data_generator.navigate.htmlindex import HTMLIndex
>>> url = "https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/"
>>> for f in HTMLIndex(url).iter():
>>>     print(f)
HTMLIndexItem (path:https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/SHA1SUM, last_modified:2013-09-02 11:16:00)
...
HTMLIndexItem (path:https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/wi_elpd_3dp_19941231_v02.cdf, last_modified:2010-03-29 04:39:00)

HTMLIndex.iter returns a HTMLIndexItem that contains information parsed from the index page such as file path and date of latest modification. It supports the following optional arguments :

  • recursive (default: False) : recursively traverses the index, will return all files stored under the given URL.
  • extension (string, default: None): only return items with the given extension, simply checks that the filename ends with the provided string. By default iter will return all files.
  • modified_since (datetime, default: None): only return files that have been modified since the provided date.
  • name_regex (compiled regex pattern, default: None): only return files whose names match the given regex pattern.

Downloading from URL

amda_data_generator.get.URLGetter exposes a get method that accepts an IndexItem (base class for the HTMLIndexItem) object and returns the contents as a amda_data_generator.get.DataFileContainer.

>>> from amda_data_generator.get import URLGetter
>>> downloader = URLGetter()
>>> f = downloader.get(item)
>>> print(f)
<amda_data_generator.get.DatafileContainer object at 0x7f53176f0070>
>>> # get a file like object by calling
>>> # Notice that the file is NOT a CDF file it is a SHA1SUM file
>>> f.get_data().read()
b'# rehashed on 2013-09-02 15:16:10\n
...
wi_elpd_3dp_19941231_v02.cdf\n

Converting to NetCDF

Datafile converters extend the base amda_data_generator.convert.Converter class that exposes a convert method that accepts a file-like object and an output filename and saves the converted data to that file. For example we can convert CDF files to NetCDF by using the amda_data_generator.convert.cdf.CDF2NetCDF class.

:high_voltage_sign: CDF_LIB environment variable: Reading CDF files is done using the spacepy module that requires the CDF C library path be set in the CDF_LIB environment variable.

$ export CDF_LIB=<path_to_cdflib>

In the following example we want to create a NetCDF file containing time and magnetic field data. The time variable in the CDF file is called Epoch and the magnetic field MAGF.

First define the mapping dictionary. Current implementation requires the the time values be stored as Time (AMDA requires this variable). All variables that are not present in the mapping and that depend on the Time variable will be added with their original name

>>> mapping = {"Time": "Epoch", "Mag": "MAGF"}

Make sure the HTMLIndexItem represents a CDF file.

>>> item = next(index.iter(extension=".cdf"))

Now convert the file :

>>> from amda_data_generator.convert.cdf import CDF2NetCDF
>>> converter = CDF2NetCDF(mapping=mapping)
>>> converter.convert(downloader.get(item), output_filename="output.nc")

A new output.nc file will have been created and we can check its content with the ncdump utility :

$ ncdump -h output.nc

netcdf output {
dimensions:
        Time = UNLIMITED ; // (45 currently)
        dim8 = 8 ;
        dim15 = 15 ;
        dim5 = 5 ;
        dim32 = 32 ;
        dim3 = 3 ;
        TimeLength = 17 ;
variables:
        double Time(Time) ;
        float Mag(Time, dim3) ;
        double TIME(Time) ;
        float FLUX(Time, dim8, dim15) ;
        float ENERGY(Time, dim15) ;
        float PANGLE(Time, dim8) ;
        float INTEG_T(Time) ;
        float EDENS(Time) ;
        float TEMP(Time, dim5) ;
        float QP(Time) ;
        float QM(Time) ;
        float QT(Time) ;
        float REDF(Time, dim32) ;
        float VSW(Time, dim3) ;
        char StartTime(TimeLength) ;
        char StopTime(TimeLength) ;
}

The Epoch and MAGF variables have been renamed.