README.md
AMDA data generation module
This is a python module for navigating available data files, retrieving them and converting them to a format ready to be installed in AMDA. AMDA uses the NetCDF file format to store datasets, when installing a new dataset we are often confronted with the task of downloading available files in their native format and converting them to NetCDF. This module implements a collection of usefull functions for performing such operations :
amda_data_generator.navigate.htmlindex.HTMLIndex
: parses a HTML index page and provides a iterator for navigating through it. Allows the user to provide the extension of the files that interest him.amda_data_generator.get.URLGetter
: fetches the data at a given URL and returns it as a FileType object that can be passed to a Converteramda_data_generator.convert.Converter
: base class tasked with converting an arbitrary data file to NetCDF. Can be initialized with a mapping describing transformations of the original variables (renaming, converting time to double, etc...). Subclass this object to define your own converters.amda_data_generator.convert.cdf.CDF2NetCDF
: converts a CDF file to NetCDF
Installing
Create a virtual environment, activate it and install requirements :
$ python3 -m venv venv
$ . venv/bin/activate
$ python -m pip install -r requirements.txt
$ python -m pip install .
Make sure to run the commands from the same folder containing the amda_data_generator
module.
Usage
HTML index iterator
Iterating over WIND 3DP instrument files for 1994 available at https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/ is done like so:
>>> from amda_data_generator.navigate.htmlindex import HTMLIndex
>>> url = "https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/"
>>> for f in HTMLIndex(url).iter():
>>> print(f)
HTMLIndexItem (path:https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/SHA1SUM, last_modified:2013-09-02 11:16:00)
...
HTMLIndexItem (path:https://cdaweb.gsfc.nasa.gov/pub/data/wind/3dp/3dp_elpd/1994/wi_elpd_3dp_19941231_v02.cdf, last_modified:2010-03-29 04:39:00)
HTMLIndex.iter
returns a HTMLIndexItem
that contains information parsed from the index page such as file path and date of latest modification. It supports the following optional arguments :
recursive
(default:False
) : recursively traverses the index, will return all files stored under the given URL.extension
(string, default:None
): only return items with the given extension, simply checks that the filename ends with the provided string. By defaultiter
will return all files.modified_since
(datetime
, default:None
): only return files that have been modified since the provided date.name_regex
(compiled regex pattern, default:None
): only return files whose names match the given regex pattern.
Downloading from URL
amda_data_generator.get.URLGetter
exposes a get
method that accepts an IndexItem
(base class
for the HTMLIndexItem
) object and
returns the contents as a amda_data_generator.get.DataFileContainer
.
>>> from amda_data_generator.get import URLGetter
>>> downloader = URLGetter()
>>> f = downloader.get(item)
>>> print(f)
<amda_data_generator.get.DatafileContainer object at 0x7f53176f0070>
>>> # get a file like object by calling
>>> # Notice that the file is NOT a CDF file it is a SHA1SUM file
>>> f.get_data().read()
b'# rehashed on 2013-09-02 15:16:10\n
...
wi_elpd_3dp_19941231_v02.cdf\n
Converting to NetCDF
Datafile converters extend the base amda_data_generator.convert.Converter
class that exposes a convert
method that accepts a file-like object and an output filename and saves the converted data to that
file. For example we can convert CDF files to NetCDF by using the amda_data_generator.convert.cdf.CDF2NetCDF
class.
CDF_LIB
environment variable: Reading CDF files is done using the spacepy
module that requires the CDF C library path be set in the CDF_LIB
environment variable.
$ export CDF_LIB=<path_to_cdflib>
In the following example we want to create a NetCDF file containing time and magnetic field data. The
time variable in the CDF file is called Epoch
and the magnetic field MAGF
.
First define the mapping dictionary. Current implementation requires the the time values be stored
as Time
(AMDA requires this variable). All variables that are not present in the mapping and that
depend on the Time
variable will be added with their original name
>>> mapping = {"Time": "Epoch", "Mag": "MAGF"}
Make sure the HTMLIndexItem
represents a CDF file.
>>> item = next(index.iter(extension=".cdf"))
Now convert the file :
>>> from amda_data_generator.convert.cdf import CDF2NetCDF
>>> converter = CDF2NetCDF(mapping=mapping)
>>> converter.convert(downloader.get(item), output_filename="output.nc")
A new output.nc
file will have been created and we can check its content with the ncdump
utility :
$ ncdump -h output.nc
netcdf output {
dimensions:
Time = UNLIMITED ; // (45 currently)
dim8 = 8 ;
dim15 = 15 ;
dim5 = 5 ;
dim32 = 32 ;
dim3 = 3 ;
TimeLength = 17 ;
variables:
double Time(Time) ;
float Mag(Time, dim3) ;
double TIME(Time) ;
float FLUX(Time, dim8, dim15) ;
float ENERGY(Time, dim15) ;
float PANGLE(Time, dim8) ;
float INTEG_T(Time) ;
float EDENS(Time) ;
float TEMP(Time, dim5) ;
float QP(Time) ;
float QM(Time) ;
float QT(Time) ;
float REDF(Time, dim32) ;
float VSW(Time, dim3) ;
char StartTime(TimeLength) ;
char StopTime(TimeLength) ;
}
The Epoch
and MAGF
variables have been renamed.