Overview
GGG-RS is an extension to the GGG retrieval that provides updated version of several of the post-processing programs. The long term intention is to also make this a library of functions useful for working with GGG-related files.
Installation
At present, installing GGG-RS requires that you be able to build it from source. This requires, at a minimum, a Rust toolchain installed. If you wish to compile the programs that work with netCDF files, you will also need either
- one of the
micromamba
,mamba
, orconda
package managers, or - the
cmake
build tool.
These are necessary to install or build the netCDF and HDF5 C libraries. Detailed instructions and installation options are provided in the README.
Documentation
This book primarily focues on the command line programs provided by GGG-RS.
As the library is made available, the APIs will be documented through docs.rs (for the Rust library) and readthedocs.io
(for the Python library).
Each command line program will provide help when given the -h
or --help
flags.
That help should be your first resource to understand how the programs work, as it will be the most up-to-date.
The chapters in this book go into more detail about advanced usage of the programs.
If you find something in this book that is out of date or unclear, please open an issue or pull request with the documentation
tag.
Contributing
The best documentation is the result of collaboration between users and developers with different perspectives. As developers, it is easy for us to assume users have certain background knowledge that, in reality, we only have because we are so immersed in this codebase. In general, if you see something that is unclear, missing, or out of date in this book, please open an issue or, even better, a pull request on the GGG-RS repo, with the "documentation" tag.
Contribution criteria
Requests to update the documentation will be evaluated on several criteria, primarily:
- accuracy (does the change correctly reflect the behavior of the program?),
- clarity (is the change easy to understand and as concise as possible?), and
- scope (does the change add information that should be included in that location, or is it better suited to another source?)
When opening an issue, providing as much information as possible will help it be resolved quickly, as will responding promptly when asked for more input. Likewise, when opening a pull request, providing a clear description of what was changed and why will help us review it efficiently. If you are providing a pull request, please verify that the edits render correctly by following the instructions in Building the book, below.
We reserve the right to turn down requests for changes if, in our opinion, they make the documentation worse, or if the requestor does not provide sufficient information to make the needed change clear. Well explained and respectful requests will usually be accepted.
Common types of questions
Below are some common questions and details on what sort of information to provide when asking for an update to help us resolve the problem efficiently.
The information I need is not where I expected
When opening an issue, be clear about what information you are trying to find, where you expected to find it in the documentation, and why you were looking there. Understanding how you expect the information to be organized helps us examine if there might be other ways we need to connect different parts of the documentation. Generally, the best fix for this problem is to identify where a link between parts of the documentation will help guide people to the correct page. Other solutions may be appropriate in more complicated cases.
A program is not included in the book
First, please check that it is one of the GGG-RS programs.
This means it will have a folder under the src/bin
subdirectory of the repository.
The GGG-RS programs will coexist with regular GGG and EGI-RS programs in $GGGPATH/bin/
, so just because a program
exists there does not mean it will be documented here.
If a program really is missing, then either open an issue or create a pull request that add it. If creating a pull request, please match the structure of the existing programs' documentation.
I do not understand what the documentation is trying to explain
When you encounter something that is not clear, please first try to figure it out yourself by following any examples and testing things out. If that section of the documentation links to external resources (e.g., the TOML format), please review those resources as well.
If something truly is unclear, then open an issue and do your best to explain what you were trying to accomplish and what you found difficult to understand. Explaining the overall task you were trying to accomplish, as well as the part of the documentation that you found unclear, will help us identify if this is an XY problem, where the reason it was unclear is because there is a better solution to your task than the one you were trying to use.
Building the book
If you want to edit this book, it lives in the book
subdirectory of the GGG-RS repo.
The source code is in Markdown, and can be edited with any plain text editor (e.g., Notepad, TextEdit, vim, nano) or
code editor (e.g., VSCode).
It is built with mdbook with the additional
mdbook-admonish preprocessor.
To check that the book renders correctly when making edits:
- Install both
mdbook
andmdbook-admonish
follow the instructions in their repositories. - In the
book
subdirectory of GGG-RS repository (which you will have cloned to your computer), runmdbook serve
. - Copy the "localhost" link it prints and paste it into any web browser.
You will see the rendered local copy of the book, and it will automatically update each time you make changes.
Please do this before submitting any pull requests, as it will slow things down significantly if we have to iterate multiple times with you to ensure the book builds correctly.
Setup helper programs
change_ggg_files
list_spectra
Post processing programs
Post processing programs are used after completing GFIT runs on all windows. These programs perform a combination of collating and averaging data from the different windows and applying the necessary post hoc corrections to produce the best quality data.
We are currently in a transitional period, where the post processing programs are still provided in a mix of languages. Some remain the original Fortran versions from GGG2020, others have been replaces with Rust versions, and the private netCDF writer remains written in Python. The intention is to transition away from the Python netCDF writer in GGG2020.2, but to retain a mix of Fortran and Rust programs at that time. Whether all post processing programs transition to Rust depends on whether there is a need for more flexibility in all programs.
EM27/SUN users
Those who use GGG to process EM27/SUN data must be aware that EGI (the wrapper program to streamline processing of EM27/SUN data with GGG) is also in a transitional phase. The original EGI does not use GGG-RS programs, and instead patches some of the existing GGG Fortran programs to work with EM27/SUN data, as well as works around some limitations of the Fortran post processing code by swapping out some of the configuration files on disk. This works, but is inconvenient when you need to process both TCCON and EM27/SUN data.
A full rewrite of EGI, EGI-RS is in progress. This is intended to be easier to maintain and modular, allowing smaller parts of the GGG workflow to be run independently with EGI-RS automation as needed. EGI-RS does use the programs provided by GGG-RS, and in fact relies on several of them to simplify switching between TCCON and EM27/SUN configurations. Throughout this section, the EM27/SUN standard use sections will be referring to the EGI-RS use. Those still using the original EGI should be aware that while the role of each program in the EM27/SUN post processing is the same as its original Fortran predecessor, the specifics of how it is run may differ.
collate_tccon_results
Purpose
collate_tccon_results
combines output from the various .col
files in a GGG run directory with ancillary data from the runlog and .ray
file into a single file.
It also computed retrieved quantities from the .col
files if needed.
Which .col
files are read is determined by a multiggg.sh
file, which is expected to contain calls to gfit
(one per line) for each window processed.
An example of a the first few lines of a multiggg.sh
file is:
/home/user/ggg/bin/gfit luft_6146.pa_ggg_benchmark.ggg>/dev/null
/home/user/ggg/bin/gfit hf_4038.pa_ggg_benchmark.ggg>/dev/null
/home/user/ggg/bin/gfit h2o_4565.pa_ggg_benchmark.ggg>/dev/null
/home/user/ggg/bin/gfit h2o_4570.pa_ggg_benchmark.ggg>/dev/null
Unlike the standard collate_results
, collate_tccon_results
does not rely on the ZPD times of spectra to determine whether successive spectra in the runlog
(from different detectors) should have their outputs grouped into a single line in the output file.
Instead, it uses the spectrum names.
Examples
The most common way to run this is from inside a GGG run directory (i.e., a directory containing the multiggg.sh
, .ggg
, .mav
, and .ray
files created by gsetup
).
In that case, you will call it with a single positional argument, v
to create a file containg vertical column densities or t
to create one containing VMR scale factors.
The output file will have the same name as the runlog pointed to in the .ggg
and .col
files, with the extension .vsw
or .tsw
:
$GGGPATH/bin/collate_tccon_results v
If you need to run this program from outside of a GGG run directory, you can use the --multiggg-file
option to point to the multiggg.sh
file to read windows from.
In this case, the output will be written to the same directory as the multiggg.sh
file:
$GGGPATH/bin/collate_tccon_results --multiggg-file /data/ggg/xx20250101_20250301/multiggg.sh
This program relies on being able to determine a "primary" detector in order to know which spectra represent a new observation.
If you have a nonstandard setup that does not use "a" as the character in the spectrum name to represent the InGaAs detector, you can use
the --primary-detector
option to specify a different character.
For example, if it should look for spectra with "g" as the detector indicator:
$GGGPATH/bin/collate_tccon_results --primary-detector g v
Use in TCCON standard processing
Most users will use this as part of running the post_processing.sh
script to create the initial .vsw
and .tsw
files.
However, in any case, it will always be the first program run after GFIT as the other post processing programs need a post
processed file as input (i.e., they do not read from the .col
files.)
Use in EM27/SUN standard processing
collate_tccon_results
is used identically when processing EM27/SUN data as when processing TCCON data.
Unlike with the Fortran version of collate_results
, it does not need adapted to account for the shorter time between successive observations.
apply_tccon_airmass_correction
Purpose
apply_tccon_airmass_correction
does two things:
- Converts gas column densities (in molecules per area) to column averages by dividing by the O2 column and multiplying by the mean O2 atmospheric dry mole fraction.
- Applies a solar zenith angle-dependent correction to those column averages that require it, as defined by a configuration file.
This can operate either on individual windows' column densities or on column densities calculated by averaging together all the windows for a given gas. The former is considered to be more accurate, as it allows for different airmass dependence per window (which depends on the spectroscopy in that window), but the latter is preserved for backwards compatibility.
Examples
This program requires two arguments: a path to a file defining the airmass corrections and
a path to either a .vsw
file (created by collate_tccon_results
) or .vav
file (created
by average_results
):
$GGGPATH/bin/apply_tccon_airmass_correction CORRECTION_FILE VSW_OR_VAV_FILE
The CORRECTION_FILE
will usually be one of those supplied with GGG, in the $GGGPATH/tccon
subdirectory.
See the configuration section for the details of this
file's format if you need to modify one or create your own.
Use in TCCON standard processing
For TCCON standard processing, the CORRECTION_FILE
must be $GGGPATH/tccon/corrections_airmass_preavg.dat
,
as these are the airmass correction factors derived for the required TCCON data version.
It must be run on the vsw
file output by collate_tccon_results
, as the TCCON standard processing uses per-window
airmass corrections.
This is automatically configured in the post_processing.sh
file, therefore standard users should rely on the
post_processing.sh
script to run the required post processing steps in the correct order.
Use in EM27/SUN standard processing
As of GGG2020, EM27/SUNs still use per-gas airmass corrections, rather than per-window.
Therefore, this must be run on the .vav
file output by average_results
using the EM27/SUN-specific
airmass corrections included with EGI-RS.
If using EGI-RS correctly, it will automatically create a post_processing.sh
file with the correct
post processing order for an EM27/SUN, so normal users should rely on that.
Airmass correction file format
For backwards compatibility, this file is in the typical GGG input file format of a line specifying the "shape" of the file, one or more header rows, then tabular data. These files come in two forms. The first defines airmass corrections for each window, and includes not only the magnitude of the correction, but two additional parameters that adjust the form of the equation used for the correction.
Per-window format
The GGG2020 TCCON standard file in this format is as follows:
15 5
2017-02-16 GCT
2015-08-11 DW
2019-01-14 JLL
2020-12-04 JLL: extrapolated to Xluft = 0.999
2021-02-22 JLL: fit for mid-trop PT = 310 K
2021-07-15 JLL: uncertainties added from 2-sigma std dev of bootstrapped PT = 310 K fits
Contains airmass-dependent correction factors to be applied to the
column-averaged mole fractions. These are determined offline from the
symmetric component of the diurnal variation using derive_airmass_correction.
The ADCF_Err should be the 1-sigma standard deviations which represent day-to-
day variability. This vastly overestimates the uncertainty in the average
value, however the standard error underestimates the uncertainty.
g and p are the zero-SZA and exponent in the ADCF form.
Gas ADCF ADCF_Err g p
"xco2_6220" -0.00903 0.00025 15 4
"xco2_6339" -0.00512 0.00025 45 5
"xlco2_4852" 0.00008 0.00018 -45 1
"xwco2_6073" -0.00235 0.00016 -45 1
"xwco2_6500" -0.00970 0.00026 45 5
"xch4_5938" -0.00971 0.00046 25 4
"xch4_6002" -0.00602 0.00053 -5 2
"xch4_6076" -0.00594 0.00044 15 3
"xn2o_4395" 0.00523 0.00054 -5 2
"xn2o_4430" 0.00426 0.00042 13 3
"xn2o_4719" -0.00267 0.00056 -15 2
"xco_4233" 0.00000 0.00000 13 3
"xco_4290" 0.00000 0.00000 13 3
"xluft_6146" 0.00053 0.00017 -45 1
The components are as follows:
- The first line specifies the number of header lines and the number of data columns. This must be two integers separated by whitespace. The number of header lines includes this line and the column headers.
- The next
nhead-2
lines (line numbers 2 to 14 in this case) are free format; these are skipped by the program. You can see in the example that thse are used to record the history of the file and notes about the content of the file. - The last header line (line number 15 in this case) gives the column names; it must include the five columns shown here.
A common error is to add lines to the header without updating the number of header lines
on the first line.
If you get an error running apply_tccon_airmass_correction
after editing the correction
file's header, double check that you also updated the number of header lines!
The data are as follows:
- "Gas" is the Xgas window name that the correction defined on this line applies to.
It must be a string that matches a non-error column in the input
.vsw
file with "x" prepended. As this is read in as list directed format data, it is recommended to quote the strings. - "ADCF" is the airmass dependent correction factor, it determines the magnitude of the airmass correction.
- "ADCF_Err" is the uncertainty on the ADCF.
- "g" and "p" are parameters in the airmass correction equation.
Deriving the correction parameters is a complicated process. For details, along with the definition of the airmass correction equation, please see section 8.1 of the GGG2020 paper.
Per-gas format
The second format of the airmass correction file is as follows:
13 3
2017-02-16 GCT
2015-08-11 DW
Contains airmass-dependent and airmass-independent (in situ)
correction factors to be applied to the column-averaged mole fractions.
The former (ADCF) is determined offline from the symmetric component
of the diurnal variation using derive_airmass_correction.
The ADCF_Err are the 1-sigma standard deviations which represent day-to-
day variability. This vastly overestimates the uncertainty in the average
value, however the standard error underestimates the uncertainty.
The latter (AICF) is determined offline by comparisons with in situ profiles.
AICF_Err (uncertainties) are 1-sigma standard deviations from the best fit.
Gas ADCF ADCF_Err
"xco2" -0.0049 0.0009
"xch4" -0.0045 0.0005
"xn2o" 0.0133 0.0001
"xco" 0.0000 0.0001
"xh2o" -0.0000 0.0001
"xluft" 0.0027 0.0005
This is a simplified version of the per-window format above. As above, the first line defines the number of header lines and data columns. This file must have three data columns: "Gas", "ADCF", and "ADCF_Err". These have the same meanings as in the per-window format. The "g" and "p" columns can be omitted, as shown here.
apply_tccon_insitu_correction
Purpose
apply_tccon_insitu_correction
has a single purpose, that is to apply a scalar divisor scale factor
to specific column-average quantities.
This is typically used to ensure that these quantities are tied to the same metrological scale as
comparable in situ data.
Examples
This program requires two arguments: a path to a file defining the scaling corrections and
a path to either a .vav.ada
file, created by apply_tccon_airmass_correction
:
$GGGPATH/bin/apply_tccon_insitu_correction CORRECTION_FILE VAV_ADA_FILE
The CORRECTION_FILE
will usually be one of those supplied with GGG, in the $GGGPATH/tccon
subdirectory.
See the configuration section for the details of this
file's format if you need to modify one or create your own.
Use in TCCON standard processing
For TCCON standard processing, the CORRECTION_FILE
must be $GGGPATH/tccon/corrections_insitu_postavg.dat
,
as these are the in situ correction factors derived for the required TCCON data version.
It must be run on the .vav.ada
file output by apply_tccon_airmass_correction
.
This is automatically configured in the post_processing.sh
file, therefore standard users should rely on the
post_processing.sh
script to run the required post processing steps in the correct order.
Use in EM27/SUN standard processing
EM27/SUNs have their own in situ correction factors.
These correction factors are provided with EGI-RS, and automatically added to the $GGGPATH/tccon
directory
when running the em27-init
program included with EGI-RS.
If using EGI-RS correctly, it will automatically create a post_processing.sh
file with the correct
post processing order for an EM27/SUN, so normal users should rely on that.
In situ correction file format
For backwards compatibility, this file is in the typical GGG input file format of a line specifying the "shape" of the file, one or more header rows, then tabular data. The standard TCCON GGG2020 is show here as an example:
19 4
2017-02-16 GCT
2015-08-11 DW
2021-07-26 JLL
This file contains airmass-independent correction factors (AICFs) determined
offline by comparison against in situ data. CO2, wCO2, lCO2, CH4, and H2O use AICFs
determined as the weighted mean ratio of TCCON to in situ Xgas values. N2O
uses the ratio of TCCON to surface-derived in situ XN2O fit to a mid-tropospheric
potential temperature of 310 K, ensuring that it is fixed to the same temperature
as the ADCFs. CO, H2O, and Luft are not corrected.
For CO2, wCO2, lCO2, CH4, H2O, and CO the AICF_Err (uncertainties) are 2-sigma standard
deviations of bootstrapped mean ratios. For N2O, the error equals the fit vs. potential
temperature multiplied by twice the standard deviation of potential temperatures
experienced by the TCCON network.
The WMO_Scale column gives the WMO scale from the in situ data that the scale
factor ties to. There must be something in this column and must be quoted;
use "N/A" for gases with no WMO scaling. NB: although CO has no WMO scale (because it
is not scaled), the uncertainty was determined from the measurements on the WMO X2014A scale.
Gas AICF AICF_Err WMO_Scale
"xco2" 1.0101 0.0005 "WMO CO2 X2007"
"xwco2" 1.0008 0.0005 "WMO CO2 X2007"
"xlco2" 1.0014 0.0007 "WMO CO2 X2007"
"xch4" 1.0031 0.0014 "WMO CH4 X2004"
"xn2o" 0.9821 0.0098 "NOAA 2006A"
"xco" 1.0000 0.0526 "N/A"
"xh2o" 0.9883 0.0157 "ARM Radiosondes (Lamont+Darwin)"
"xluft" 1.0000 0.0000 "N/A"
The components are as follows:
- The first line specifies the number of header lines and the number of data columns. This must be two integers separated by whitespace. The number of header lines includes this line and the column headers.
- The next
nhead-2
lines (line numbers 2 to 18 in this case) are free format; these are skipped by the program. You can see in the example that these are used to record the history of the file and notes about the content of the file. - The last header line (line number 19 in this case) gives the column names; it must include the four columns shown here.
A common error is to add lines to the header without updating the number of header lines
on the first line.
If you get an error running apply_tccon_insitu_correction
after editing the correction
file's header, double check that you also updated the number of header lines!
The data are as follows:
- "Gas" refers to the column in the
.vav.ada
file that the correction applies to (along with the associated error, i.e., "xco2" will apply to both the "xco2" and "xco2_error" columns). - "AICF" is the scaling factor; the Xgas and error will be divided by this.
- "AICF_Err" is the uncertainty on the scaling factor.
- "WMO_Scale" is the metrological scale or other reference to which the scale factor ties the These must be quoted strings.
add_nc_flags
Purpose
add_nc_flags
is an additional program that can be used to add additional flags to a private netCDF file.
This is intended to supplement the default flagging done by write_netcdf
, which reflects both the permitted
value ranges defined in your site's ??_qc.dat
and ??_manual_flagging.dat
files (found under $GGGPATH/tccon
).
In particular, this is useful if you need to include more complex logic, such as only flagging based on a combination
of variables.
Examples
Quick flagging
The quick
subcommand allows you to specify the filter criteria based on a single variable via the command line.
This first example will flag any data where the residual in the O2 window is above 0.5, and will modify the existing
netCDF file:
$GGGPATH/bin/add_nc_flags quick \
--in-place \
--filter-var o2_7885_rmsocl \
--greater-than 0.5 \
--nc-file PRIVATE_NC_FILE
Note that we have separated the command into multiple lines solely for readability; you can write this as a single line.
If instead you did not want to modify the existing netCDF file, but instead create a copy, use the --output
flag:
$GGGPATH/bin/add_nc_flags quick \
--output NEW_NC_FILE \
--filter-var o2_7885_rmsocl \
--greater-than 0.5 \
--nc-file PRIVATE_NC_FILE
This will not create NEW_NC_FILE
if no data needed to be flagged.
If you want to enforce that a new file is always created, add the --always-copy
flag:
$GGGPATH/bin/add_nc_flags quick \
--output NEW_NC_FILE \
--always-copy \
--filter-var o2_7885_rmsocl \
--greater-than 0.5 \
--nc-file PRIVATE_NC_FILE
If you wanted to limit the flags to a specific time period, use --time-less-than
and --time-greater-than
.
Note that the values must be given in UTC:
$GGGPATH/bin/add_nc_flags quick \
--output NEW_NC_FILE \
--filter-var o2_7885_rmsocl \
--greater-than 0.5 \
--time-greater-than 2025-04-01T12:00 \
--time-less-than 2025-05-01 \
--nc-file PRIVATE_NC_FILE
This will only apply flags data with a ZPD time greater than or equal to 12:00Z on 1 Apr 2025 and less than or equal to 00:00Z on 1 May 2025. Note that in the less than argument we omit the hour and minute.
There are many more options, see the command line help for a full list.
TOML-based flagging
For more complicated flagging, you can define your flagging criteria in a TOML file.
You can create an example file with the toml-template
subcommand:
$GGGPATH/bin/add_nc_flags toml-template example.toml
This will create example.toml
in the current directory.
Once you have defined your filters, you apply this file with the toml
subcommand.
As with the quick
subcommand, flags can be applied to the existing file:
$GGGPATH/bin/add_nc_flags toml TOML_FILE --in-place --nc-file PRIVATE_NC_FILE
or to a copy of it:
$GGGPATH/bin/add_nc_flags toml TOML_FILE --output NEW_NC_FILE --nc-file PRIVATE_NC_FILE
For details on the TOML file settings, see the following section.
Use in TCCON standard processing
This program is not used by default in TCCON post processing.
(That is, it will not be included in the post_processing.sh
script.)
Users are welcome to use it separately to flag out data with known problems from the private netCDF files before uploading to Caltech.
Use in EM27/SUN standard processing
EGI-RS will include a line in the post_processing.sh
script to run this program on the private netCDF file.
The intention is for users to add extra data checks to deal with EM27/SUN-specific issues that may affect the data.
Additionally, EGI-RS may add certain required filters in the future as the use of GGG for EM27/SUN retrievals matures.
add_nc_flags
TOML filter configuration
Filters and groups
The TOML files used to define flags for add_nc_flags
use the terms "group" and "filter" as follows:
- A "filter" defines an allowed range for a single variable. It may specify and upper or lower limit, an allowed range, or an excluded range.
- A "group" consists of one or more filters.
An observation in the netCDF file will be flagged if any of the groups defined in the TOML file indicate that it should be. A group will indicate an observation should be flagged if all of the filters in that group indicate it should be flagged.
The first example shows how you would define a TOML file that duplicates the filter we used in the quick filter example:
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_rmsocl"
greater_than = 0.5
This defines a single check: if value of the o2_7885_rmsocl
variable in the netCDF file is
>= 0.5
, that observation will be flagged.
Now suppose that we wanted to flag observations only if o2_7885_rmsocl >= 0.5
and o2_7885_cl < 0.05
,
perhaps to remove observations where our instrument was not tracking the sun well.
To do this, we add a second filter to this group like so:
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_rmsocl"
greater_than = 0.5
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05
Now, because these both come under the same [[group]]
heading, observations will only be flagged if both conditions are true.
What if we wanted to do an or, that is, filter if either one of two (or more) conditions are true? That requires multiple groups:
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_rmsocl"
greater_than = 0.5
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_sg"
greater_than = 0.1
Because each filter comes after its own [[group]]
header, they fall in separate groups.
If either group has all its filters return true, then the observation will be flagged.
In this case, that means that an observation with o2_7885_rmsocl >= 0.5
or o2_7885_sg <= 0.1
will be flagged.
What if we wanted to flag an observation with an o2_7885_sg
value too far from zero in either direction?
We can do that with the value_mode
key, like so:
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_sg"
greater_than = 0.1
less_than = -0.1
value_mode = "outside"
This will cause an observation to be flagged if o2_7885_sg <= -0.1
or o2_7885_sg >= +0.1
.
If we do not specify value_mode
, the default is "inside", which in this case would flag
if -0.1 <= o2_7885_sg <= +0.1
.
Limiting to times
The TOML file allows you to specify that it should only apply to a specific time frame with the [timespan]
section.
This allows three keys: time_less_than
, time_greater_than
, and time_mode
.
For example, perhaps you wish to filter on continuum level only between two times when you know your instrument
was not tracking the sun correctly.
You could do so with:
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05
[timespan]
time_greater_than = "2025-01-01T00:00:00"
time_less_than = "2025-05-01T00:00:00"
Note that the times must be in UTC and given in the full "yyyy-mm-ddTHH:MM:SS" format; unlike the quick
command line
option, you cannot truncate these to just "yyyy-mm-dd" or "yyyy-mm-ddTHH:MM".
time_mode
, similar to value_mode
in the filters, allows you to only flag observations outside of the given time
range, rather than inside it:
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05
[timespan]
time_greater_than = "2025-01-01T00:00:00"
time_less_than = "2025-05-01T00:00:00"
time_mode = "outside"
This will apply the filter to any data before 1 Jan 2025 and after 1 May 2025, whereas the previous example would apply to data between those two dates.
Changing the flag
Flag value
By default, when add_nc_flags
applies a flag, it does so by adding 9000 to the value of the flag
variable for
that observation.
This preserves any of the standard flags from variables defined in the ??_qc.dat
file.
By TCCON convention, a 9 in the thousands place of the flag represents a generic "other" manual flag.
If you wish to use one of the other values to distinguish the nature of the problem, you can define a [flags]
section:
[flags]
flag = 8
# The configuration must always include a [[groups]] entry with at least one filter.
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05
This will add 8000 to the flag instead of 9000; i.e., it will put an 8 in the thousands place.
The value of flag
must be between 1 and 9, since it must fit into the thousands place.
Some values have existing meanings. Currently these are defined in
a JSON file bundled with the private netCDF writer.
When the private netCDF writer is incorporated into GGG-RS, that definition file will be moved into this repository.
Behavior for existing flags
We can also adjust how add_nc_flags
behaves if there is already a value in the thousands place.
By default, it will error.
We can change this by setting the existing_flags
key in the [flags]
section.
For example, to keep existing values in the thousands place of the flag (which would be set by either
the time periods defined in your $GGGPATH/tccon/??_manual_flagging.dat
or a previous run of add_nc_flags
):
[flags]
flag = 8
existing_flags = "skip"
# The configuration must always include a [[groups]] entry with at least one filter.
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05
The possible (case insensitive) values for existing_flags
are:
"error"
(default) - error if any of the observations to be flagged already have a non-zero value in the flag's thousands place"skip"
- if an observation to be flagged already has a value in the flag's thousands place, leave the existing value."skipeq"
- if the value in the thousands place is 0 is will be replaced, otherwiseadd_nc_flags
will error unless the value matches what it would insert."overwrite"
- replace any existing value in the flag's thousands place.
Flag type
The default behavior, as mentioned above, is to modify the thousands place of the flag for observations to be flagged.
add_nc_flags
can also edit the ten thousands place, which is used for release flags.
To do so, set flag_type
to "release"
in the [flags]
section:
[flags]
flag = 8
flag_type = "release"
# The configuration must always include a [[groups]] entry with at least one filter.
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05
Release flags are intended to be set by Caltech personnel based on input from the reviewers during data QA/QC.
If you use add_nc_flags
to set release flags in other circumstances, this can lead to significant confusion
when trying to establish the provenance of certain flags.
Please do not use flag_type = "release"
unless you have received specific guidance to do so!
write_public_netcdf
Purpose
write_public_netcdf
converts the private (a.k.a. engineering) netCDF files into smaller, more user-friendly, files distributed to most TCCON users.
This includes:
- limiting the variables to the most useful,
- removing
flag != 0
data, - optionally limiting the data in the file based on the desired data latency, and
- expanding the averaging kernels and prior profiles to be one-per-spectrum.
If you previously used the Python netCDF writer, you may be used to it defaulting to respecting a data latency (a.k.a. release lag) defined in a site information JSON file. This version of the netCDF writer defaults to no data latency; that is, it assumes that you want to include all data from the given private file in the new public file. See the examples below for how to apply a data latency to withhold the newest data from the public file.
Examples
The simplest use will convert the PRIVATE_NC_FILE
into a public format file.
This assumes that the PRIVATE_NC_FILE
filename begins with the two-character site ID for your site.
The public file will be in the same directory as the private file, and its name will reflect the site ID and the date range of the flag == 0
data:
$GGGPATH/bin/write_public_netcdf PRIVATE_NC_FILE
To avoid renaming the public file to match the dates of flag == 0
data, use the --no-rename-by-dates
flag.
This will replace "private" in the extension with "public", so if PRIVATE_NC_FILE
was pa_ggg_benchmark.private.qc.nc
, the public file would be named pa_ggg_benchmark.public.qc.nc
:
$GGGPATH/bin/write_public_netcdf --no-rename-by-dates PRIVATE_NC_FILE
Both of the above examples will use the standard TCCON configuration for which variables to copy.
To use the extended TCCON configuration (which will include gases from the secondary detector), add the --extended
flag:
$GGGPATH/bin/write_public_netcdf --extended PRIVATE_NC_FILE
If you need to customize which variables are copied, you must create your own configuration TOML file and pass it to the --config
option:
$GGGPATH/bin/write_public_netcdf --config CUSTOM_CONFIG.toml PRIVATE_NC_FILE
For information on the configuration file format, see its section of this book.
To withhold the newest data from the public file, you can use the --data-latency-date
or --data-latency-days
options
to specify either a number of days in the past from today or a specific date after which to withhold data.
$GGGPATH/bin/write_public_netcdf --data-latency-date 2025-01-01 PRIVATE_NC_FILE
$GGGPATH/bin/write_public_netcdf --data-latency-days 120 PRIVATE_NC_FILE
The first one will withhold data with a ZPD time after midnight UTC on 1 Jan 2025 from the public file. The second will withhold data with a ZPD time after midnight UTC 120 days ago from the public file: if run on 1 May 2025 (UTC), this would also have 1 Jan 2025 as the cutoff date.
Use in TCCON standard processing
Individual TCCON sites should not need to use this program under normal circumstances. This program will be run at Caltech on the concatenated and quality controlled private netCDF files, and the resulting public netCDF files will be made available through tccondata.org. This function is provided as part of GGG-RS for sites that have, for example, low latency or custom products delivered to specific users as non-standard TCCON data, but wish to provide the data in the user-friendly public format instead of the much more intimidating private file format. Presently, you will need to follow the instructions on the TCCON wiki to generate a concatenated and quality controlled private file, then run this program on the resulting file. Note that access permission is required for this wiki page to track who is generating GGG public files.
Use in EM27/SUN standard processing
As there is not yet an equivalent of the TCCON data pipeline at Caltech for EM27/SUN data processed with GGG, operators will likely want to use this program to generate public files of their data for upload to whatever data repository they host from. Presently, you will need to follow the instructions on the TCCON wiki to generate a concatenated and quality controlled private file, then run this program on the resulting file. Note that access permission is required for this wiki page to track who is generating GGG public files.
Configuration
The public netCDF writer must strike a balance between being strict enough to ensure that the required variable for standard TCCON usage are included in normal operation, but also be flexible enough to allow non-standard usage. To enable more flexible use, the writer by default requires the standard TCCON variables be present, but can be configured to change the required variables.
The configuration file uses TOML format. The configuration file can be broadly broken down into five sections:
- auxiliary variables,
- derived variables,
- Xgas variable sets,
- Xgas discovery, and
- default settings.
Auxiliary variables
Auxiliary variables are those which are not directly related to one of the target Xgases but which provide useful information about the observations. Common examples are time, latitude, longitude, solar zenith angle, etc.
These are defined in the aux
section of the TOML file as an array of tables.
The simplest way to define an auxiliary variable to copy is to give the name of the private variable in the netCDF file and what value to use as the long name:
[[aux]]
private_name = "solzen"
long_name = "solar zenith angle"
This will copy the variable solzen
from the private netCDF file along with all its attributes except standard_name
and precision
, add the long_name
attribute, and put the variable's data (subsetting to flag == 0
data) into the public file as solzen
.
Note that the long_name
value should follow the CF conventions meaning.
We prefer long_name
over standard_name
because the available standard names do not adequately describe remotely sensed quantities.
If instead you wanted to rename the variable in the public file, you can add the public_name
field:
[[aux]]
private_name = "solzen"
public_name = "solar_zenith_angle"
long_name = "solar zenith angle"
This would rename the variable to solar_zenith_angle
in the public file, but otherwise behave identically to above.
You can also control the attributes copied through two more fields, attr_overrides
and attr_to_remove
.
attr_overrides
is a TOML table of attibute names and values that will be added to the public variable.
If an attribute is listed in the private file with the same name as an override, the override value in the config takes precedence.
The latter is an array of attribute names to skip copying if present.
(If one of these attributes is not present in the private file, nothing happens.)
An example:
[[aux]]
private_name = "day"
long_name = "day of year"
attr_overrides = {units = "Julian day", description = "1-based day of year"}
attr_to_remove = ["vmin", "vmax"]
This will add or replace the attributes units
and description
in the public file with those given here, and ensure that the vmin
and vmax
attributes are not copied.
Take note, specifying attr_to_remove
overrides the default list of standard_name
and precision
; this can be useful if you want to retain those (you can do so by specifying attr_to_remove = []
), but if you want to exclude them, you must add them to your list.
Finally, by default any auxiliary variable listed here must be found in the private netCDF file, or the public writer stops with an error.
To change this behavior so that a variable is optional, add the required = false
field to an aux variable:
[[aux]]
private_name = "day"
long_name = "day of year"
required = false
Each auxiliary variable to copy will have its own [[aux]]
section, for example:
[[aux]]
private_name = "time"
long_name = "zero path difference UTC time"
[[aux]]
private_name = "year"
long_name = "year"
[[aux]]
private_name = "day"
long_name = "day of year"
attr_overrides = {units = "Julian day", description = "1-based day of year"}
[[aux]]
private_name = "solzen"
long_name = "solar zenith angle"
By default, any of the standard TCCON auxiliary variables not listed will be added. See the Defaults section for how to modify that behavior.
Computed variables
Computed variables are similar to auxiliary variables in that they are not directly associated with a single Xgas. Unlike auxiliary variables, these cannot be simply copied from the private netCDF file. Instead, they must be computed from one or more private variables. Because of that, there are a specific set of these variables pre-defined by the public writer. Currently only one computed variable type exists, "prior_source". You can specify it in the configuration as follows:
[[computed]]
type = "prior_source"
By default, this creates a public variable named "apriori_data_source".
You can change this with the public_name
field, e.g.:
[[computed]]
type = "prior_source"
public_name = "geos_source_set"
Xgas discovery
Usually we do not want to specify every single Xgas to copy; instead, we want the writer to scan the private file to identify Xgases and copy everything that matches. This both saves a lot of tedious typing in the configuration and minimizes the possibilty of copy-paste errors.
Rules
The first part of the discovery section is a list of rules for how to find Xgas variables. These come in two variants:
- Suffix rules: these look for variables that start with something starting with an Xgas-like pattern
and ending in the given suffix. The full regex is
^x([a-z][a-z0-9]*)_{suffix}$
, where{suffix}
is the provided suffix. Note that the suffix is passed through [regex::escape
] to ensure that any special characters are escaped; it will only be treated as a literal. - Regex rules: these allow you to specify a regular expression to match variables names. The regex must include a named
capture group with the name "gas" that extracts the physical gas abbreviation (i.e., the
gas
value in anXgas
entry). This looks like(?<gas>...)
where the...
is the regular subexpression that matches that part of the string.
By default, the configuration will add a single regex rule that matches the pattern ^x(?<gas>[a-z][a-z0-9]*)$
.
You can disable this by setting xgas_rules = false
in the [defaults]
section of the config.
This rule is designed to match basic Xgas variables, e.g., "xch4", "xn2o", etc.
An example of a regular expression rule that uses the default ways to infer its ancillary variables is:
[[discovery.rule]]
regex = '^column_average_(?<gas>\w+)$'
Two things to note are:
- The regular expression is inside single quotes; this is how TOML specifies literal strings and it the most convenient way to write regexes that include backslashes. (Otherwise TOML itself will intepret them as escape characters.)
- The regex includes
^
and$
to anchor the start and end of the pattern. In most cases, you will probably want to do so as well to avoid matching arbitrary parts of variable names.
An example of a suffix rule that also indicates that variables matching this rule should not include averaging kernels or the traceability scale is:
[[discovery.rule]]
suffix = "mir"
ak = { type = "omit" }
traceability_scale = { type = "omit" }
Note that the suffix rule contains a "suffix" key, while the regular expression rule has a "regex" key - this is how they are distinguished. Also note that rules are checked in order, and a variable is added following the first rule that matches. This means that if a variable matches multiple rules, then its ancillary variables will be set up following the first rule that matched.
Attributes
Discovery rules can specify the fields xgas_attr_overrides
, xgas_error_attr_overrides
, prior_profile_attr_overrides
,
prior_xgas_attr_overrides
, and ak_attr_overrides
to set attributes on their respective variables.
These should be used for attributes that will be the same for all the variables of that type created by this rule.
For example, to add a cautionary note about experimental data to our previous mid-IR discovery rule:
[[discovery.rule]]
suffix = "mir"
xgas_attr_overrides = { note = "Experimental data, use with caution!" }
ak = { type = "omit" }
traceability_scale = { type = "omit" }
Ancillary variables
The rules also include default settings for the prior profile, prior column average, averaging kernel (and its slant Xgas bins), and the traceability scale.
These can be specified the same way as described in the Xgases ancillary subsection,
and the defaults are the same as well.
However only the inferred
and omit
types may be used, as specified
does not make sense when a rule may apply to more than one Xgas.
Exclusions
The second part of the discovery section are lists of gases or variables to exclude.
The first option is excluded_xgas_variables
.
If a variable's private file name matches one of the names in that list, it will not be copied even if it matches one of the rules.
The other option is excluded_gases
, which matches not the variable name, but the physical gas.
The easiest way to explain this is to consider the standard TCCON configuration:
[discovery]
excluded_xgas_variables = ["xo2"]
excluded_gases = ["th2o", "fco2", "zco2"]
excluded_xgas_variables
specifically excludes the "xo2" variable; this would match the default regex rule meant to capture Xgases measured on the primary detector, but we don't want to include it because it is not useful for data users.
However, O2 measured on a silicon detector may be useful, so we do not want to exclude all O2 variables.
excluded_gases
lists three gases that we want to exclude no matter what detector they are retrieved from.
"fco2" and "zco2" are diagnostic windows (for channelling and zero-level offset, respectively) and so will be included once for each detector.
"th2o" is temperature sensitive water, which is generally confusing for the average user, so we want to ensure that it is also excluded from every detector.
Explicitly specified Xgases
This section allows you to list specific Xgas variables to copy, along with some or all of the ancillary variables needed to properly interpret them. Usually, you will not specify each Xgas by hand, but instead will use the discovery capability of the writer to automatically find each Xgas to copy. However, variables explicitly listed in this section take precedence over those found by the discovery rules. This leads to two cases where you might specify an Xgas in this section:
- The Xgas does not follow the common naming convention, thus making it difficult to discover with simple rules, or difficult to map from the Xgas variable to the ancillary variable.
- The Xgas needs a different way to handle one of its ancillary variables that the default discovery does.
Each Xgas has the following options:
xgas
(required): the variable namegas
(required): the physical gas name, e.g., "co2" for all the various CO2 variables (regular,wco2
, andlco2
). This is used to match up to, e.g., the priors which do not distinguish between the different spectra windows.gas_long
(optional): the full name of the gas instead of its abbreviation, e.g. "carbon dioxide" for CO2. If not given, then the configuration will try to find thegas
value in its[gas_long_names]
section and use that, falling back on thegas
value if the gas is not defined in the gas long names section.xgas_attr_overrides
(optional): a table of attribute values that can override existing attribute values on the private Xgas variable.xgas_error_attr_overrides
(optional): a table of attribute values that can override existing attribute values on the private Xgas error variable.prior_profile
(optional): how to copy the a priori profile.prior_profile_attr_overrides
(optional): a table of attribute values that can override existing attribute values on the private prior profile variable.prior_xgas
(optional): how to copy the a priori column average.prior_xgas_attr_overrides
(optional): a table of attribute values that can override existing attribute values on the private prior Xgas variable.ak
(optional): how to copy the averaging kernels.ak_attr_overrides
(optional): a table of attribute values that can override existing attribute values on the private AK variable.slant_bin
(optional): how to find the slant Xgas bin variable needed to expand the AKs.traceability_scale
(optional): how to find the variable containing the WMO or analogous scale for this data.required
(optional): this istrue
by default; set it tofalse
if you want to copy this Xgas if present but it is not an error if it is missing.
prior_profile
, prior_xgas
, ak
, slant_bin
, and traceability_scale
can all be defined following the syntax in the ancillary variable specifications.
Note that slant_bin
is a special case in that it will only be used if the AKs are to be copied, but cannot be omitted in that case.
To illustrate the two main use cases for this section, here is an excerpt from the standard TCCON configuration:
[[xgas]]
xgas = "xluft"
gas = "luft"
gas_long = "dry air"
prior_profile = { type = "omit" }
prior_xgas = { type = "omit" }
ak = { type = "omit" }
[[xgas]]
xgas = "xco2_x2019"
gas = "co2"
prior_profile = { type = "specified", only_if_first = true, private_name = "prior_1co2", public_name = "prior_co2" }
prior_xgas = { type = "specified", only_if_first = true, private_name = "prior_xco2_x2019", public_name = "prior_xco2" }
ak = { type = "specified", only_if_first = true, private_name = "ak_xco2" }
slant_bin = { type = "specified", private_name = "ak_slant_xco2_bin" }
[[xgas]]
xgas = "xwco2_x2019"
gas = "co2"
prior_profile = { type = "specified", only_if_first = true, private_name = "prior_1co2", public_name = "prior_co2" }
prior_xgas = { type = "specified", only_if_first = true, private_name = "prior_xwco2_x2019", public_name = "prior_xco2" }
ak = { type = "specified", only_if_first = true, private_name = "ak_xwco2" }
slant_bin = { type = "specified", private_name = "ak_slant_xwco2_bin" }
First we have xluft
.
This variable would be discovered by the default rule; however, that rule will require prior information and AKs.
The prior information is not useful for Xluft, so we want to avoid copying that to reduce the number of extraneous variables, and there are no AKs for Xluft.
Thus we specify "omit" for each of these to tell the writer not to look for them.
We do not have to tell it to omit slant_bin
, because omitting the AKs implicitly skips that, and traceability_scale
can be left as normal because there is an "aicf_xluft_scale" variable in the private netCDF files.
Second we have xco2_x2019
and xwco2_x2019
.
(We have omitted the x2007 variables and lco2_x2019
from the above example for brevity).
These would not be discovered by the default rule.
Further, the mapping to their prior and AK variables is unique: all the CO2 Xgas variables can share the prior profiles and column averages, and each "flavor" of CO2 (regular, wCO2, or lCO2) can use the same AKs whether it is on the X2007 or X2019 scale.
Thus, we not only define that these variables need copied, but that we want to rename the prior variables to just "prior_co2" and "prior_xco2" and only copy these the first time we find them.
We also ensure that the AKs and slant bins point to the correct variables.
If we wanted to set the note
attribute of xluft
, we could do that like so:
[[xgas]]
xgas = "xluft"
gas = "luft"
gas_long = "dry air"
prior_profile = { type = "omit" }
prior_xgas = { type = "omit" }
ak = { type = "omit" }
xgas_attr_overrides = { note = "This is a diagnostic quantity" }
Not all attributes can be overridden, some are set internally by the netCDF writer to ensure consistency. If an attribute is not getting the value you expect, first check the logging output from the netCDF writer for a warning that a particular attribute cannot be set.
Ancillary variable specifications
The ancillary variables (prior profile, prior Xgas, AK, slant bin, and traceability scale) can be defined as one of the following three types:
inferred
: indicates that this ancillary variable must be included and should not conflict with any other variable. The private and public variable names will be inferred from the Xgas and gas names. This type has two options:only_if_first
: a boolean (false
by default) that when set totrue
will skip copying the relevant variable if a variable with the same public name is already in the public file. Note that the writer does not check that the existing variable's data are equal to what would be written for the new variable!required
: a boolean (true
by default) that when set tofalse
allows this variable to be missing from the private file. This is intended for Xgas discovery rules more than explicit Xgas definitions.
specified
: allows you to specify exactly which variable to copy with theprivate_name
field. You can also give thepublic_name
field to indicate what the variable name in the output file should be; if that is omitted, then the public variable will have the same name as the private variable. It is an error if the public variable already exists. The also allows theonly_if_first
field, which behaves how it does for theinferred
type.omit
: indicates that this variable should not be copied.
In the following example, the prior Xgas field shows the use of the inferred
options, the prior profile field
shows the use of the specified
options, and the AK field the use of omit
.
[[xgas]]
xgas = "xhcl"
gas = "hcl"
prior_xgas = { type = "inferred", only_if_first = true, required = false }
prior_profile = { type = "specified", only_if_first = true, private_name = "prior_1hcl", public_name = "prior_hcl" }
ak = { type = "omit" }
Ancillary variable name inference
The writer uses the following rules when asked to infer ancillary variable names.
In these, {xgas_var}
indicates the Xgas variable name and {gas}
the physical gas name.
prior_profile
: looks for a private variable namedprior_1{gas}
and writes to a variable namedprior_{gas}
.prior_xgas
: looks for a private variable namedprior_{xgas_var}
and writes to the same variable.ak
: looks for a private variable namedak_{xgas_var}
and writes to the same variable.slant_bin
: looks for a private variable namedak_slant_{xgas_var}_bin
. This is not written, it is only used to expand the AKs to one-per-spectrum.traceability_scale
: looks for a private variable namedaicf_{xgas_var}_scale
. The result is always written to thewmo_or_analogous_scale
attribute of the Xgas variable; that cannot be altered by this configuration.
Gas proper names
Ideally, all Xgases should include their proper name in the long_name
attribute, rather than just its abbreviation.
This section allows you to map the formula (e.g., "co2") to the proper name (e.g., "carbon dioxide"), e.g.:
[gas_long_names]
co2 = "carbon dioxide"
ch4 = "methane"
co = "carbon monoxide"
Note that the keys are the gases, not Xgases.
A default list is included if not turned off in the [Defaults]
section.
See the source code for DEFAULT_GAS_LONG_NAMES
for the current list.
You can override any of those without turning off the defaults; e.g., setting h2o = "dihydrogen monoxide"
in this section will replace the default of "water".
Of course, when explicitly defining an Xgas to copy, you can write in the proper name as the gas_long
value.
The [gas_long_names]
section is most useful for automatically discovered Xgases, but it can also be useful when defining multiple Xgas variables that refer to the same physical gas, as the standard TCCON configuration does with CO2.
Global attributes
What global attributes (i.e., attributes from the root group of the netCDF file) to copy to the public file are defined by the [global_attributes]
section.
This section contains two lists of strings:
must_copy
lists the names of attributes that must be available in the private netCDF file, or an error is raised.copy_if_present
lists the names of attributes to copy if available in the private netCDF file, but to not raise an error for if missing.
An abbreviated example from the TCCON standard configuration is:
[global_attributes]
must_copy = [
"source",
"description",
"file_creation",
]
copy_if_present = [
"long_name",
"location",
]
At present, there is no way to manipulate attributes' values during the copying process, nor add arbitrary attributes. In general, attributes should be added to the private netCDF file, then copied to the public file. This ensures that attributes are consistent between the two files. However, in the future we may add the ability to define some special cases.
The history
attribute is a special case, it will always be created or appended to following the
CF conventions,
no matter what the configuration says.
To avoid conflicts with this built in behavior, do not specify history
as an attribute to copy in the configuration file.
Including other configurations
The public netCDF writer configuration can extend other configuration files. The intended use is to define a base configuration with a set of variables that should always be included, and extend that for different specialize use cases. The TCCON standard configurations use this to reuse the standard configuration (including all of the auxiliary variables, normal computer variables, and Xgas values from the InGaAs detector) in the extended configuration (which adds the InSb or Si Xgas values).
To use another configuration file, use the include
key:
include = ["base_config.toml"]
This is a list, so you can include multiple files:
include = ["base_config.toml", "extra_aux.toml"]
Currently, when giving relative paths as the values for include
as done here,
they are interpreted as relative to the current working directory.
However, you should not rely on that behavior - the intention is to adjust this
so that relative paths are interpreted as relative to the configuration file
in which they appear.
If you have a global set of configuration files that use include
, for now,
it is best to use absolute paths in their respective include
sections.
How configurations are combined
Internally, write_public_netcdf
uses the figment
crate to combine the configurations.
Specifically, it uses the "adjoin" conflict resolution strategy.
This means that lists (like the explicit Xgas definitions) from each configuration will be concatenated,
and scalar values will be taken from the first configuration that defines them.
(The order in which the configurations are parsed is defined next).
Order of inclusion
The include
key is recursive, so if file1.toml
includes file2.toml
, and
file2.toml
includes file3.toml
, then file1.toml
will include the combination
of file2.toml
and file3.toml
.
When files include more than one other file, the algorithm does a "depth-first" ordering.
That is, if our top-level configuration has:
# top.toml
include = ["middle1.toml", "middle2.toml"]
middle1.toml
has:
# middle1.toml
include = ["bottom1a.toml", "bottom1b.toml"]
and middle2.toml
has:
# middle2.toml
include = ["bottom2a.toml", "bottom2b.toml"]
then the order in which the files are added to the configuration is:
top.toml
middle1.toml
bottom1a.toml
bottom1b.toml
middle2.toml
bottom2a.toml
bottom2b.toml
In other words, coupled with the "adjoin" behavior used,
this ensures that the inclusion behavior is as you expect.
Settings in top.toml
take precedence, then all settings in middle1.toml
(whether they
are defined in middle1.toml
itself or one of its included files), and then middle2.toml
(and its included files) comes last.
Although you can create complex hierarchies of configurations as shown in this example, doing so is generally not a good idea. The more complicated you try to make the set of included files, the more likely you are to end up with unexpected results - duplicated variables, wrong attributes, etc. If you find yourself using more than one layer of inclusion, you may be better off simply creating one large configuration file with the necessary parts of the other files copied into it.
Defaults
Unlike other sections, the [defaults]
section does not define variables to copy; instead, it modifies how the other sections are filled in.
If this section is omitted, then each of the other sections will add reasonable default values if omitted.
The following boolean options are available to change that behavior:
disable_all
: setting this totrue
will ensure that no defaults are added in any section.aux_vars
: setting this tofalse
will prevent TCCON standard auxiliary variables from being added in theaux
section.gas_long_names
: setting this tofalse
will prevent the standard mapping of chemical formulae to proper gas names being added to[gas_long_names]
.xgas_rules
: setting this tofalse
will prevent the standard list of patterns to match when looking for Xgas variables from being added to[discovery.rules]
.
Debugging tips
Configuration parsing errors
If you are trying to use a custom configuration and the writer gives an error that it cannot "deserialize" the file,
that means that there is either a TOML syntax error or another typo.
To narrow down where the problem is, incrementally comment out parts of the configuration file and run the writer
with the --check-config-only
flag.
When this can parse the configuration, it will print an internal representation of it to the screen.
Thus, when it starts working, whatever section you commented out is the likely culprit.
Unexpected output (such as missing or duplicated variables)
If you seem to be missing variables in the output, have the same variable try to be written twice,
or other problems when using a custom configuration, first run the writer with the --check-config-only
flag and carefully examine the printed parsed version of the configuration.
This can help check if the configuration is being interpreted as you intended,
especially when using the include feature
Checking on Xgas discovery
If variables are not being copied correctly, increase the verbosity of write_public_netcdf
by adding -v
or -vv
to the command line. The first will activate debug output, which includes a lot of information about
Xgas discovery. -vv
will also activate trace-level logging, which will output even more information about the
configuration as the program read it.
Site metadata file
write_public_netcdf
can take the data latency/release lag from a TOML file that specifies site metadata.
There is an example for the standard TCCON sites in the repo.
This file must have top-level keys that match site two-character IDs, each containing a number of metadata values.
For example:
[pa]
long_name = "parkfalls01"
release_lag = 120
location = "Park Falls, Wisconsin, USA"
contact = "Paul Wennberg <wennberg@gps.caltech.edu>"
data_revision = "R1"
data_doi = "10.14291/tccon.ggg2014.parkfalls01.R1"
data_reference = "Wennberg, P. O., C. Roehl, D. Wunch, G. C. Toon, J.-F. Blavier, R. Washenfelder, G. Keppel-Aleks, N. Allen, J. Ayers. 2017. TCCON data from Park Falls, Wisconsin, USA, Release GGG2014R1. TCCON data archive, hosted by CaltechDATA, California Institute of Technology, Pasadena, CA, U.S.A. http://doi.org/10.14291/tccon.ggg2014.parkfalls01.R1"
site_reference = "Washenfelder, R. A., G. C. Toon, J.-F. L. Blavier, Z. Yang, N. T. Allen, P. O. Wennberg, S. A. Vay, D. M. Matross, and B. C. Daube (2006), Carbon dioxide column abundances at the Wisconsin Tall Tower site, Journal of Geophysical Research, 111(D22), 1-11, doi:10.1029/2006JD007154. Available from: https://www.agu.org/pubs/crossref/2006/2006JD007154.shtml"
[oc]
long_name = "lamont01"
release_lag = 120
location = "Lamont, Oklahoma, USA"
contact = "Paul Wennberg <wennberg@gps.caltech.edu>"
data_revision = "R1"
data_doi = "10.14291/tccon.ggg2014.lamont01.R1/1255070"
data_reference = "Wennberg, P. O., D. Wunch, C. Roehl, J.-F. Blavier, G. C. Toon, N. Allen, P. Dowell, K. Teske, C. Martin, J. Martin. 2017. TCCON data from Lamont, Oklahoma, USA, Release GGG2014R1. TCCON data archive, hosted by CaltechDATA, California Institute of Technology, Pasadena, CA, U.S.A. https://doi.org/10.14291/tccon.ggg2014.lamont01.R1/1255070"
site_reference = ""
Although the public netCDF writer only uses release_lag
, each site must contain the following keys for this file to be valid:
long_name
: the site's location readable name followed by a two-digit number indicating which instrument at that site this is.release_lag
: an integer >= 0 specifying how many days after acquisition data should be kept private. TCCON sites are not permitted to set this >366, as data delivery to the public archive within one year is a network requirement.location
: a description of the location where the instrument resides. This is usually "city, state/province/country", but can include things such as institution if desired.contact
: the name and email address, formatted asNAME <EMAIL>
of the person users should contact with questions or concern about this site's data.data_revision
: an "R" followed by a number >= 0 indicating which iteration of GGG2020 reprocessing this data represents. This should be incremented whenever previously public data was reprocessed to fix an error.
The following keys may be provided, but are not required:
data_doi
: A digital object identifier that points to the public data for this site. This should be included if possible; it is optional only so that public files can be created before the first time a DOI is minted.data_reference
: A reference to a persistent repository where the data can be downloaded. For TCCON sites, this will be CaltechData. For other instruments, this may vary for now.site_reference
: A reference to a publication describing the site location itself (as opposed to the data).
TCCON sites can find the most up-to-date versions of their values for this metadata at https://tccondata.org/metadata/siteinfo/. Other users should do their best to ensure that the above conventions are followed.