Overview

GGG-RS is an extension to the GGG retrieval that provides updated version of several of the post-processing programs. The long term intention is to also make this a library of functions useful for working with GGG-related files.

Installation

At present, installing GGG-RS requires that you be able to build it from source. This requires, at a minimum, a Rust toolchain installed. If you wish to compile the programs that work with netCDF files, you will also need either

one of the micromamba, mamba, or conda package managers, or
the cmake build tool.

These are necessary to install or build the netCDF and HDF5 C libraries. Detailed instructions and installation options are provided in the README.

Documentation

This book primarily focues on the command line programs provided by GGG-RS. As the library is made available, the APIs will be documented through docs.rs (for the Rust library) and readthedocs.io (for the Python library).

Each command line program will provide help when given the -h or --help flags. That help should be your first resource to understand how the programs work, as it will be the most up-to-date. The chapters in this book go into more detail about advanced usage of the programs. If you find something in this book that is out of date or unclear, please open an issue or pull request with the documentation tag.

Contributing

The best documentation is the result of collaboration between users and developers with different perspectives. As developers, it is easy for us to assume users have certain background knowledge that, in reality, we only have because we are so immersed in this codebase. In general, if you see something that is unclear, missing, or out of date in this book, please open an issue or, even better, a pull request on the GGG-RS repo, with the "documentation" tag.

Contribution criteria

Requests to update the documentation will be evaluated on several criteria, primarily:

accuracy (does the change correctly reflect the behavior of the program?),
clarity (is the change easy to understand and as concise as possible?), and
scope (does the change add information that should be included in that location, or is it better suited to another source?)

When opening an issue, providing as much information as possible will help it be resolved quickly, as will responding promptly when asked for more input. Likewise, when opening a pull request, providing a clear description of what was changed and why will help us review it efficiently. If you are providing a pull request, please verify that the edits render correctly by following the instructions in Building the book, below.

We reserve the right to turn down requests for changes if, in our opinion, they make the documentation worse, or if the requestor does not provide sufficient information to make the needed change clear. Well explained and respectful requests will usually be accepted.

Common types of questions

Below are some common questions and details on what sort of information to provide when asking for an update to help us resolve the problem efficiently.

The information I need is not where I expected

When opening an issue, be clear about what information you are trying to find, where you expected to find it in the documentation, and why you were looking there. Understanding how you expect the information to be organized helps us examine if there might be other ways we need to connect different parts of the documentation. Generally, the best fix for this problem is to identify where a link between parts of the documentation will help guide people to the correct page. Other solutions may be appropriate in more complicated cases.

A program is not included in the book

First, please check that it is one of the GGG-RS programs. This means it will have a folder under the src/bin subdirectory of the repository. The GGG-RS programs will coexist with regular GGG and EGI-RS programs in $GGGPATH/bin/, so just because a program exists there does not mean it will be documented here.

If a program really is missing, then either open an issue or create a pull request that add it. If creating a pull request, please match the structure of the existing programs' documentation.

I do not understand what the documentation is trying to explain

When you encounter something that is not clear, please first try to figure it out yourself by following any examples and testing things out. If that section of the documentation links to external resources (e.g., the TOML format), please review those resources as well.

If something truly is unclear, then open an issue and do your best to explain what you were trying to accomplish and what you found difficult to understand. Explaining the overall task you were trying to accomplish, as well as the part of the documentation that you found unclear, will help us identify if this is an XY problem, where the reason it was unclear is because there is a better solution to your task than the one you were trying to use.

Building the book

If you want to edit this book, it lives in the book subdirectory of the GGG-RS repo. The source code is in Markdown, and can be edited with any plain text editor (e.g., Notepad, TextEdit, vim, nano) or code editor (e.g., VSCode). It is built with mdbook with the additional mdbook-admonish preprocessor.

To check that the book renders correctly when making edits:

Install both mdbook and mdbook-admonish follow the instructions in their repositories.
In the book subdirectory of GGG-RS repository (which you will have cloned to your computer), run mdbook serve.
Copy the "localhost" link it prints and paste it into any web browser.

You will see the rendered local copy of the book, and it will automatically update each time you make changes.

Please do this before submitting any pull requests, as it will slow things down significantly if we have to iterate multiple times with you to ensure the book builds correctly.

Compatibility

Some GGG-RS programs can be configured to change their output or expected input to be compatible with previous versions of GGG. Programs that support this behavior will have a --compat command line option. If that option is not specified, they will take their default from the GGGRS_COMPAT environmental variable. If neither the command line option nor the environmental variable are set, then GGG-RS will not make any adjustments to ensure compatibility.

Compatibility options

current: no special modifications will be made to I/O.
stable: an alias for the most recent GGG release; currently GGG2020.
ggg2020: post-processing files will be kept compatible with the GGG2020 post-processing Fortran programs. Specifically, the O2 global mean dry mole fraction will not be written as a 26th auxiliary data column.

Deprecation policy

GGG-RS will only guarantee compatibility with the last GGG release, major or minor. For example, once GGG2020.1 released, support for GGG2020.0 is not guaranteed.

Compatibility back to the last major release may be supported, if the complexity of doing so does not detract from code maintainability. For example, support for GGG2020 may be maintained after the GGG2020.1 release, but only if doing so does not make maintaining the GGG-RS code unfeasibly complicated.

Setup helper programs

change_ggg_files

list_spectra

Post processing programs

Post processing programs are used after completing GFIT runs on all windows. These programs perform a combination of collating and averaging data from the different windows and applying the necessary post hoc corrections to produce the best quality data.

We are currently in a transitional period, where the post processing programs are still provided in a mix of languages. Some remain the original Fortran versions from GGG2020, others have been replaces with Rust versions, and the private netCDF writer remains written in Python. The intention is to transition away from the Python netCDF writer in GGG2020.2, but to retain a mix of Fortran and Rust programs at that time. Whether all post processing programs transition to Rust depends on whether there is a need for more flexibility in all programs.

EM27/SUN users

Those who use GGG to process EM27/SUN data must be aware that EGI (the wrapper program to streamline processing of EM27/SUN data with GGG) is also in a transitional phase. The original EGI does not use GGG-RS programs, and instead patches some of the existing GGG Fortran programs to work with EM27/SUN data, as well as works around some limitations of the Fortran post processing code by swapping out some of the configuration files on disk. This works, but is inconvenient when you need to process both TCCON and EM27/SUN data.

A full rewrite of EGI, EGI-RS is in progress. This is intended to be easier to maintain and modular, allowing smaller parts of the GGG workflow to be run independently with EGI-RS automation as needed. EGI-RS does use the programs provided by GGG-RS, and in fact relies on several of them to simplify switching between TCCON and EM27/SUN configurations. Throughout this section, the EM27/SUN standard use sections will be referring to the EGI-RS use. Those still using the original EGI should be aware that while the role of each program in the EM27/SUN post processing is the same as its original Fortran predecessor, the specifics of how it is run may differ.

collate_tccon_results

Purpose

collate_tccon_results combines output from the various .col files in a GGG run directory with ancillary data from the runlog and .ray file into a single file. It also computed retrieved quantities from the .col files if needed. Which .col files are read is determined by a multiggg.sh file, which is expected to contain calls to gfit (one per line) for each window processed. An example of a the first few lines of a multiggg.sh file is:

/home/user/ggg/bin/gfit luft_6146.pa_ggg_benchmark.ggg>/dev/null
/home/user/ggg/bin/gfit hf_4038.pa_ggg_benchmark.ggg>/dev/null
/home/user/ggg/bin/gfit h2o_4565.pa_ggg_benchmark.ggg>/dev/null
/home/user/ggg/bin/gfit h2o_4570.pa_ggg_benchmark.ggg>/dev/null

Unlike the standard collate_results, collate_tccon_results does not rely on the ZPD times of spectra to determine whether successive spectra in the runlog (from different detectors) should have their outputs grouped into a single line in the output file. Instead, it uses the spectrum names.

Examples

The most common way to run this is from inside a GGG run directory (i.e., a directory containing the multiggg.sh, .ggg, .mav, and .ray files created by gsetup). In that case, you will call it with a single positional argument, v to create a file containg vertical column densities or t to create one containing VMR scale factors. The output file will have the same name as the runlog pointed to in the .ggg and .col files, with the extension .vsw or .tsw:

$GGGPATH/bin/collate_tccon_results v

If you need to run this program from outside of a GGG run directory, you can use the --multiggg-file option to point to the multiggg.sh file to read windows from. In this case, the output will be written to the same directory as the multiggg.sh file:

$GGGPATH/bin/collate_tccon_results --multiggg-file /data/ggg/xx20250101_20250301/multiggg.sh

This program relies on being able to determine a "primary" detector in order to know which spectra represent a new observation. If you have a nonstandard setup that does not use "a" as the character in the spectrum name to represent the InGaAs detector, you can use the --primary-detector option to specify a different character. For example, if it should look for spectra with "g" as the detector indicator:

$GGGPATH/bin/collate_tccon_results --primary-detector g v

Use in TCCON standard processing

Most users will use this as part of running the post_processing.sh script to create the initial .vsw and .tsw files. However, in any case, it will always be the first program run after GFIT as the other post processing programs need a post processed file as input (i.e., they do not read from the .col files.)

Use in EM27/SUN standard processing

collate_tccon_results is used identically when processing EM27/SUN data as when processing TCCON data. Unlike with the Fortran version of collate_results, it does not need adapted to account for the shorter time between successive observations.

apply_tccon_airmass_correction

Purpose

apply_tccon_airmass_correction does two things:

Converts gas column densities (in molecules per area) to column averages by dividing by the O2 column and multiplying by the mean O2 atmospheric dry mole fraction.
Applies a solar zenith angle-dependent correction to those column averages that require it, as defined by a configuration file.

This can operate either on individual windows' column densities or on column densities calculated by averaging together all the windows for a given gas. The former is considered to be more accurate, as it allows for different airmass dependence per window (which depends on the spectroscopy in that window), but the latter is preserved for backwards compatibility.

Examples

This program requires two arguments: a path to a file defining the airmass corrections and a path to either a .vsw file (created by collate_tccon_results) or .vav file (created by average_results):

$GGGPATH/bin/apply_tccon_airmass_correction CORRECTION_FILE VSW_OR_VAV_FILE

The CORRECTION_FILE will usually be one of those supplied with GGG, in the $GGGPATH/tccon subdirectory. See the configuration section for the details of this file's format if you need to modify one or create your own.

Use in TCCON standard processing

For TCCON standard processing, the CORRECTION_FILE must be $GGGPATH/tccon/corrections_airmass_preavg.dat, as these are the airmass correction factors derived for the required TCCON data version. It must be run on the vsw file output by collate_tccon_results, as the TCCON standard processing uses per-window airmass corrections. This is automatically configured in the post_processing.sh file, therefore standard users should rely on the post_processing.sh script to run the required post processing steps in the correct order.

Use in EM27/SUN standard processing

As of GGG2020, EM27/SUNs still use per-gas airmass corrections, rather than per-window. Therefore, this must be run on the .vav file output by average_results using the EM27/SUN-specific airmass corrections included with EGI-RS. If using EGI-RS correctly, it will automatically create a post_processing.sh file with the correct post processing order for an EM27/SUN, so normal users should rely on that.

Airmass correction file format

For backwards compatibility, this file is in the typical GGG input file format of a line specifying the "shape" of the file, one or more header rows, then tabular data. These files come in two forms. The first defines airmass corrections for each window, and includes not only the magnitude of the correction, but two additional parameters that adjust the form of the equation used for the correction.

Per-window format

The GGG2020 TCCON standard file in this format is as follows:

15 5
2017-02-16  GCT
2015-08-11  DW
2019-01-14  JLL
2020-12-04  JLL: extrapolated to Xluft = 0.999
2021-02-22  JLL: fit for mid-trop PT = 310 K
2021-07-15  JLL: uncertainties added from 2-sigma std dev of bootstrapped PT = 310 K fits
Contains airmass-dependent correction factors to be applied to the
column-averaged mole fractions. These are determined offline from the
symmetric component of the diurnal variation using derive_airmass_correction.
The ADCF_Err should be the 1-sigma standard deviations which represent day-to-
day variability. This vastly overestimates the uncertainty in the average
value, however the standard error underestimates the uncertainty.
g and p are the zero-SZA and exponent in the ADCF form.
 Gas         ADCF      ADCF_Err  g    p
"xco2_6220"  -0.00903  0.00025   15   4
"xco2_6339"  -0.00512  0.00025   45   5
"xlco2_4852"  0.00008  0.00018  -45   1
"xwco2_6073" -0.00235  0.00016  -45   1
"xwco2_6500" -0.00970  0.00026   45   5
"xch4_5938"  -0.00971  0.00046   25   4
"xch4_6002"  -0.00602  0.00053  -5    2
"xch4_6076"  -0.00594  0.00044   15   3
"xn2o_4395"   0.00523  0.00054  -5    2
"xn2o_4430"   0.00426  0.00042   13   3
"xn2o_4719"  -0.00267  0.00056  -15   2
"xco_4233"    0.00000  0.00000   13   3
"xco_4290"    0.00000  0.00000   13   3
"xluft_6146"  0.00053  0.00017  -45   1

The components are as follows:

The first line specifies the number of header lines and the number of data columns. This must be two integers separated by whitespace. The number of header lines includes this line and the column headers.
The next nhead-2 lines (line numbers 2 to 14 in this case) are free format; these are skipped by the program. You can see in the example that thse are used to record the history of the file and notes about the content of the file.
The last header line (line number 15 in this case) gives the column names; it must include the five columns shown here.

Info

A common error is to add lines to the header without updating the number of header lines on the first line. If you get an error running apply_tccon_airmass_correction after editing the correction file's header, double check that you also updated the number of header lines!

The data are as follows:

"Gas" is the Xgas window name that the correction defined on this line applies to. It must be a string that matches a non-error column in the input .vsw file with "x" prepended. As this is read in as list directed format data, it is recommended to quote the strings.
"ADCF" is the airmass dependent correction factor, it determines the magnitude of the airmass correction.
"ADCF_Err" is the uncertainty on the ADCF.
"g" and "p" are parameters in the airmass correction equation.

Deriving the correction parameters is a complicated process. For details, along with the definition of the airmass correction equation, please see section 8.1 of the GGG2020 paper.

Per-gas format

The second format of the airmass correction file is as follows:

13 3
2017-02-16  GCT
2015-08-11  DW
Contains airmass-dependent and airmass-independent (in situ)
correction factors to be applied to the column-averaged mole fractions.
The former (ADCF) is determined offline from the symmetric component
of the diurnal variation using derive_airmass_correction.
The ADCF_Err are the 1-sigma standard deviations which represent day-to-
day variability. This vastly overestimates the uncertainty in the average
value, however the standard error underestimates the uncertainty.
The latter (AICF) is determined offline by comparisons with in situ profiles.
AICF_Err (uncertainties) are 1-sigma standard deviations from the best fit.
 Gas      ADCF  ADCF_Err
"xco2"  -0.0049  0.0009
"xch4"  -0.0045  0.0005
"xn2o"   0.0133  0.0001
"xco"    0.0000  0.0001
"xh2o"  -0.0000  0.0001
"xluft"  0.0027  0.0005

This is a simplified version of the per-window format above. As above, the first line defines the number of header lines and data columns. This file must have three data columns: "Gas", "ADCF", and "ADCF_Err". These have the same meanings as in the per-window format. The "g" and "p" columns can be omitted, as shown here.

apply_tccon_insitu_correction

Purpose

apply_tccon_insitu_correction has a single purpose, that is to apply a scalar divisor scale factor to specific column-average quantities. This is typically used to ensure that these quantities are tied to the same metrological scale as comparable in situ data.

Examples

This program requires two arguments: a path to a file defining the scaling corrections and a path to either a .vav.ada file, created by apply_tccon_airmass_correction:

$GGGPATH/bin/apply_tccon_insitu_correction CORRECTION_FILE VAV_ADA_FILE

Use in TCCON standard processing

For TCCON standard processing, the CORRECTION_FILE must be $GGGPATH/tccon/corrections_insitu_postavg.dat, as these are the in situ correction factors derived for the required TCCON data version. It must be run on the .vav.ada file output by apply_tccon_airmass_correction. This is automatically configured in the post_processing.sh file, therefore standard users should rely on the post_processing.sh script to run the required post processing steps in the correct order.

Use in EM27/SUN standard processing

EM27/SUNs have their own in situ correction factors. These correction factors are provided with EGI-RS, and automatically added to the $GGGPATH/tccon directory when running the em27-init program included with EGI-RS. If using EGI-RS correctly, it will automatically create a post_processing.sh file with the correct post processing order for an EM27/SUN, so normal users should rely on that.

In situ correction file format

For backwards compatibility, this file is in the typical GGG input file format of a line specifying the "shape" of the file, one or more header rows, then tabular data. The standard TCCON GGG2020 is show here as an example:

19 4
2017-02-16  GCT
2015-08-11  DW
2021-07-26  JLL
This file contains airmass-independent correction factors (AICFs) determined
offline by comparison against in situ data. CO2, wCO2, lCO2, CH4, and H2O use AICFs
determined as the weighted mean ratio of TCCON to in situ Xgas values. N2O
uses the ratio of TCCON to surface-derived in situ XN2O fit to a mid-tropospheric
potential temperature of 310 K, ensuring that it is fixed to the same temperature
as the ADCFs. CO, H2O, and Luft are not corrected.
For CO2, wCO2, lCO2, CH4, H2O, and CO the AICF_Err (uncertainties) are 2-sigma standard
deviations of bootstrapped mean ratios. For N2O, the error equals the fit vs. potential
temperature multiplied by twice the standard deviation of potential temperatures
experienced by the TCCON network.
The WMO_Scale column gives the WMO scale from the in situ data that the scale
factor ties to. There must be something in this column and must be quoted;
use "N/A" for gases with no WMO scaling. NB: although CO has no WMO scale (because it
is not scaled), the uncertainty was determined from the measurements on the WMO X2014A scale.
 Gas     AICF  AICF_Err  WMO_Scale
"xco2"   1.0101  0.0005  "WMO CO2 X2007"
"xwco2"  1.0008  0.0005  "WMO CO2 X2007"
"xlco2"  1.0014  0.0007  "WMO CO2 X2007"
"xch4"   1.0031  0.0014  "WMO CH4 X2004"
"xn2o"   0.9821  0.0098  "NOAA 2006A"
"xco"    1.0000  0.0526  "N/A"
"xh2o"   0.9883  0.0157  "ARM Radiosondes (Lamont+Darwin)"
"xluft"  1.0000  0.0000  "N/A"

The components are as follows:

The first line specifies the number of header lines and the number of data columns. This must be two integers separated by whitespace. The number of header lines includes this line and the column headers.
The next nhead-2 lines (line numbers 2 to 18 in this case) are free format; these are skipped by the program. You can see in the example that these are used to record the history of the file and notes about the content of the file.
The last header line (line number 19 in this case) gives the column names; it must include the four columns shown here.

Info

A common error is to add lines to the header without updating the number of header lines on the first line. If you get an error running apply_tccon_insitu_correction after editing the correction file's header, double check that you also updated the number of header lines!

The data are as follows:

"Gas" refers to the column in the .vav.ada file that the correction applies to (along with the associated error, i.e., "xco2" will apply to both the "xco2" and "xco2_error" columns).
"AICF" is the scaling factor; the Xgas and error will be divided by this.
"AICF_Err" is the uncertainty on the scaling factor.
"WMO_Scale" is the metrological scale or other reference to which the scale factor ties the These must be quoted strings.

add_nc_flags

Purpose

add_nc_flags is an additional program that can be used to add additional flags to a private netCDF file. This is intended to supplement the default flagging done by write_netcdf, which reflects both the permitted value ranges defined in your site's ??_qc.dat and ??_manual_flagging.dat files (found under $GGGPATH/tccon). In particular, this is useful if you need to include more complex logic, such as only flagging based on a combination of variables.

Examples

Quick flagging

The quick subcommand allows you to specify the filter criteria based on a single variable via the command line. This first example will flag any data where the residual in the O2 window is above 0.5, and will modify the existing netCDF file:

$GGGPATH/bin/add_nc_flags quick \
  --in-place \
  --filter-var o2_7885_rmsocl \
  --greater-than 0.5 \
  --nc-file PRIVATE_NC_FILE

Note that we have separated the command into multiple lines solely for readability; you can write this as a single line.

If instead you did not want to modify the existing netCDF file, but instead create a copy, use the --output flag:

$GGGPATH/bin/add_nc_flags quick \
  --output NEW_NC_FILE \
  --filter-var o2_7885_rmsocl \
  --greater-than 0.5 \
  --nc-file PRIVATE_NC_FILE

This will not create NEW_NC_FILE if no data needed to be flagged. If you want to enforce that a new file is always created, add the --always-copy flag:

$GGGPATH/bin/add_nc_flags quick \
  --output NEW_NC_FILE \
  --always-copy \
  --filter-var o2_7885_rmsocl \
  --greater-than 0.5 \
  --nc-file PRIVATE_NC_FILE

If you wanted to limit the flags to a specific time period, use --time-less-than and --time-greater-than. Note that the values must be given in UTC:

$GGGPATH/bin/add_nc_flags quick \
  --output NEW_NC_FILE \
  --filter-var o2_7885_rmsocl \
  --greater-than 0.5 \
  --time-greater-than 2025-04-01T12:00 \
  --time-less-than 2025-05-01 \
  --nc-file PRIVATE_NC_FILE

This will only apply flags data with a ZPD time greater than or equal to 12:00Z on 1 Apr 2025 and less than or equal to 00:00Z on 1 May 2025. Note that in the less than argument we omit the hour and minute.

There are many more options, see the command line help for a full list.

TOML-based flagging

For more complicated flagging, you can define your flagging criteria in a TOML file. You can create an example file with the toml-template subcommand:

$GGGPATH/bin/add_nc_flags toml-template example.toml

This will create example.toml in the current directory.

Once you have defined your filters, you apply this file with the toml subcommand. As with the quick subcommand, flags can be applied to the existing file:

$GGGPATH/bin/add_nc_flags toml TOML_FILE --in-place --nc-file PRIVATE_NC_FILE

or to a copy of it:

$GGGPATH/bin/add_nc_flags toml TOML_FILE --output NEW_NC_FILE --nc-file PRIVATE_NC_FILE

For details on the TOML file settings, see the following section.

Use in TCCON standard processing

This program is not used by default in TCCON post processing. (That is, it will not be included in the post_processing.sh script.) Users are welcome to use it separately to flag out data with known problems from the private netCDF files before uploading to Caltech.

Use in EM27/SUN standard processing

EGI-RS will include a line in the post_processing.sh script to run this program on the private netCDF file. The intention is for users to add extra data checks to deal with EM27/SUN-specific issues that may affect the data. Additionally, EGI-RS may add certain required filters in the future as the use of GGG for EM27/SUN retrievals matures.

`add_nc_flags` TOML filter configuration

Filters and groups

The TOML files used to define flags for add_nc_flags use the terms "group" and "filter" as follows:

A "filter" defines an allowed range for a single variable. It may specify and upper or lower limit, an allowed range, or an excluded range.
A "group" consists of one or more filters.

An observation in the netCDF file will be flagged if any of the groups defined in the TOML file indicate that it should be. A group will indicate an observation should be flagged if all of the filters in that group indicate it should be flagged.

The first example shows how you would define a TOML file that duplicates the filter we used in the quick filter example:

[[groups]]
[[groups.filters]]
filter_var = "o2_7885_rmsocl"
greater_than = 0.5

This defines a single check: if value of the o2_7885_rmsocl variable in the netCDF file is >= 0.5, that observation will be flagged.

Now suppose that we wanted to flag observations only if o2_7885_rmsocl >= 0.5 and o2_7885_cl < 0.05, perhaps to remove observations where our instrument was not tracking the sun well. To do this, we add a second filter to this group like so:

[[groups]]
[[groups.filters]]
filter_var = "o2_7885_rmsocl"
greater_than = 0.5

[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05

Now, because these both come under the same [[group]] heading, observations will only be flagged if both conditions are true.

What if we wanted to do an or, that is, filter if either one of two (or more) conditions are true? That requires multiple groups:

[[groups]]
[[groups.filters]]
filter_var = "o2_7885_rmsocl"
greater_than = 0.5

[[groups]]
[[groups.filters]]
filter_var = "o2_7885_sg"
greater_than = 0.1

Because each filter comes after its own [[group]] header, they fall in separate groups. If either group has all its filters return true, then the observation will be flagged. In this case, that means that an observation with o2_7885_rmsocl >= 0.5 or o2_7885_sg <= 0.1 will be flagged.

What if we wanted to flag an observation with an o2_7885_sg value too far from zero in either direction? We can do that with the value_mode key, like so:

[[groups]]
[[groups.filters]]
filter_var = "o2_7885_sg"
greater_than = 0.1
less_than = -0.1
value_mode = "outside"

This will cause an observation to be flagged if o2_7885_sg <= -0.1 or o2_7885_sg >= +0.1. If we do not specify value_mode, the default is "inside", which in this case would flag if -0.1 <= o2_7885_sg <= +0.1.

Limiting to times

The TOML file allows you to specify that it should only apply to a specific time frame with the [timespan] section. This allows three keys: time_less_than, time_greater_than, and time_mode. For example, perhaps you wish to filter on continuum level only between two times when you know your instrument was not tracking the sun correctly. You could do so with:

[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05

[timespan]
time_greater_than = "2025-01-01T00:00:00"
time_less_than = "2025-05-01T00:00:00"

Note that the times must be in UTC and given in the full "yyyy-mm-ddTHH:MM:SS" format; unlike the quick command line option, you cannot truncate these to just "yyyy-mm-dd" or "yyyy-mm-ddTHH:MM". time_mode, similar to value_mode in the filters, allows you to only flag observations outside of the given time range, rather than inside it:

[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05

[timespan]
time_greater_than = "2025-01-01T00:00:00"
time_less_than = "2025-05-01T00:00:00"
time_mode = "outside"

This will apply the filter to any data before 1 Jan 2025 and after 1 May 2025, whereas the previous example would apply to data between those two dates.

Changing the flag

Flag value

By default, when add_nc_flags applies a flag, it does so by adding 9000 to the value of the flag variable for that observation. This preserves any of the standard flags from variables defined in the ??_qc.dat file. By TCCON convention, a 9 in the thousands place of the flag represents a generic "other" manual flag. If you wish to use one of the other values to distinguish the nature of the problem, you can define a [flags] section:

[flags]
flag = 8

# The configuration must always include a [[groups]] entry with at least one filter.
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05

This will add 8000 to the flag instead of 9000; i.e., it will put an 8 in the thousands place.

Note

The value of flag must be between 1 and 9, since it must fit into the thousands place. Some values have existing meanings. Currently these are defined in a JSON file bundled with the private netCDF writer. When the private netCDF writer is incorporated into GGG-RS, that definition file will be moved into this repository.

Behavior for existing flags

We can also adjust how add_nc_flags behaves if there is already a value in the thousands place. By default, it will error. We can change this by setting the existing_flags key in the [flags] section. For example, to keep existing values in the thousands place of the flag (which would be set by either the time periods defined in your $GGGPATH/tccon/??_manual_flagging.dat or a previous run of add_nc_flags):

[flags]
flag = 8
existing_flags = "skip"

# The configuration must always include a [[groups]] entry with at least one filter.
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05

The possible (case insensitive) values for existing_flags are:

"error" (default) - error if any of the observations to be flagged already have a non-zero value in the flag's thousands place
"skip" - if an observation to be flagged already has a value in the flag's thousands place, leave the existing value.
"skipeq" - if the value in the thousands place is 0 is will be replaced, otherwise add_nc_flags will error unless the value matches what it would insert.
"overwrite" - replace any existing value in the flag's thousands place.

Flag type

The default behavior, as mentioned above, is to modify the thousands place of the flag for observations to be flagged. add_nc_flags can also edit the ten thousands place, which is used for release flags. To do so, set flag_type to "release" in the [flags] section:

[flags]
flag = 8
flag_type = "release"

# The configuration must always include a [[groups]] entry with at least one filter.
[[groups]]
[[groups.filters]]
filter_var = "o2_7885_cl"
less_than = 0.05

Warning

Release flags are intended to be set by Caltech personnel based on input from the reviewers during data QA/QC. If you use add_nc_flags to set release flags in other circumstances, this can lead to significant confusion when trying to establish the provenance of certain flags. Please do not use flag_type = "release" unless you have received specific guidance to do so!

write_public_netcdf

Purpose

write_public_netcdf converts the private (a.k.a. engineering) netCDF files into smaller, more user-friendly, files distributed to most TCCON users. This includes:

limiting the variables to the most useful,
removing flag != 0 data,
optionally limiting the data in the file based on the desired data latency, and
expanding the averaging kernels and prior profiles to be one-per-spectrum.

Warning

If you previously used the Python netCDF writer, you may be used to it defaulting to respecting a data latency (a.k.a. release lag) defined in a site information JSON file. This version of the netCDF writer defaults to no data latency; that is, it assumes that you want to include all data from the given private file in the new public file. See the examples below for how to apply a data latency to withhold the newest data from the public file.

Examples

The simplest use will convert the PRIVATE_NC_FILE into a public format file. This assumes that the PRIVATE_NC_FILE filename begins with the two-character site ID for your site. The public file will be in the same directory as the private file, and its name will reflect the site ID and the date range of the flag == 0 data:

$GGGPATH/bin/write_public_netcdf PRIVATE_NC_FILE

To avoid renaming the public file to match the dates of flag == 0 data, use the --no-rename-by-dates flag. This will replace "private" in the extension with "public", so if PRIVATE_NC_FILE was pa_ggg_benchmark.private.qc.nc, the public file would be named pa_ggg_benchmark.public.qc.nc:

$GGGPATH/bin/write_public_netcdf --no-rename-by-dates PRIVATE_NC_FILE

Both of the above examples will use the standard TCCON configuration for which variables to copy. To use the extended TCCON configuration (which will include gases from the secondary detector), add the --extended flag:

$GGGPATH/bin/write_public_netcdf --extended PRIVATE_NC_FILE

If you need to customize which variables are copied, you must create your own configuration TOML file and pass it to the --config option:

$GGGPATH/bin/write_public_netcdf --config CUSTOM_CONFIG.toml PRIVATE_NC_FILE

For information on the configuration file format, see its section of this book.

To withhold the newest data from the public file, you can use the --data-latency-date or --data-latency-days options to specify either a number of days in the past from today or a specific date after which to withhold data.

$GGGPATH/bin/write_public_netcdf --data-latency-date 2025-01-01 PRIVATE_NC_FILE
$GGGPATH/bin/write_public_netcdf --data-latency-days 120 PRIVATE_NC_FILE

The first one will withhold data with a ZPD time after midnight UTC on 1 Jan 2025 from the public file. The second will withhold data with a ZPD time after midnight UTC 120 days ago from the public file: if run on 1 May 2025 (UTC), this would also have 1 Jan 2025 as the cutoff date.

Use in TCCON standard processing

Individual TCCON sites should not need to use this program under normal circumstances. This program will be run at Caltech on the concatenated and quality controlled private netCDF files, and the resulting public netCDF files will be made available through tccondata.org. This function is provided as part of GGG-RS for sites that have, for example, low latency or custom products delivered to specific users as non-standard TCCON data, but wish to provide the data in the user-friendly public format instead of the much more intimidating private file format. Presently, you will need to follow the instructions on the TCCON wiki to generate a concatenated and quality controlled private file, then run this program on the resulting file. Note that access permission is required for this wiki page to track who is generating GGG public files.

Use in EM27/SUN standard processing

As there is not yet an equivalent of the TCCON data pipeline at Caltech for EM27/SUN data processed with GGG, operators will likely want to use this program to generate public files of their data for upload to whatever data repository they host from. Presently, you will need to follow the instructions on the TCCON wiki to generate a concatenated and quality controlled private file, then run this program on the resulting file. Note that access permission is required for this wiki page to track who is generating GGG public files.

Configuration

The public netCDF writer must strike a balance between being strict enough to ensure that the required variable for standard TCCON usage are included in normal operation, but also be flexible enough to allow non-standard usage. To enable more flexible use, the writer by default requires the standard TCCON variables be present, but can be configured to change the required variables.

The configuration file uses TOML format. The configuration file can be broadly broken down into five sections:

auxiliary variables,
derived variables,
Xgas variable sets,
Xgas discovery, and
default settings.

Auxiliary variables

Auxiliary variables are those which are not directly related to one of the target Xgases but which provide useful information about the observations. Common examples are time, latitude, longitude, solar zenith angle, etc. These are defined in the aux section of the TOML file as an array of tables.

The simplest way to define an auxiliary variable to copy is to give the name of the private variable in the netCDF file and what value to use as the long name:

[[aux]]
private_name = "solzen"
long_name = "solar zenith angle"

This will copy the variable solzen from the private netCDF file along with all its attributes except standard_name and precision, add the long_name attribute, and put the variable's data (subsetting to flag == 0 data) into the public file as solzen. Note that the long_name value should follow the CF conventions meaning. We prefer long_name over standard_name because the available standard names do not adequately describe remotely sensed quantities.

If instead you wanted to rename the variable in the public file, you can add the public_name field:

[[aux]]
private_name = "solzen"
public_name = "solar_zenith_angle"
long_name = "solar zenith angle"

This would rename the variable to solar_zenith_angle in the public file, but otherwise behave identically to above.

You can also control the attributes copied through two more fields, attr_overrides and attr_to_remove. attr_overrides is a TOML table of attibute names and values that will be added to the public variable. If an attribute is listed in the private file with the same name as an override, the override value in the config takes precedence. The latter is an array of attribute names to skip copying if present. (If one of these attributes is not present in the private file, nothing happens.) An example:

[[aux]]
private_name = "day"
long_name = "day of year"
attr_overrides = {units = "Julian day", description = "1-based day of year"}
attr_to_remove = ["vmin", "vmax"]

This will add or replace the attributes units and description in the public file with those given here, and ensure that the vmin and vmax attributes are not copied. Take note, specifying attr_to_remove overrides the default list of standard_name and precision; this can be useful if you want to retain those (you can do so by specifying attr_to_remove = []), but if you want to exclude them, you must add them to your list.

Finally, by default any auxiliary variable listed here must be found in the private netCDF file, or the public writer stops with an error. To change this behavior so that a variable is optional, add the required = false field to an aux variable:

[[aux]]
private_name = "day"
long_name = "day of year"
required = false

Each auxiliary variable to copy will have its own [[aux]] section, for example:

[[aux]]
private_name = "time"
long_name = "zero path difference UTC time"

[[aux]]
private_name = "year"
long_name = "year"

[[aux]]
private_name = "day"
long_name = "day of year"
attr_overrides = {units = "Julian day", description = "1-based day of year"}

[[aux]]
private_name = "solzen"
long_name = "solar zenith angle"

By default, any of the standard TCCON auxiliary variables not listed will be added. See the Defaults section for how to modify that behavior.

Computed variables

Computed variables are similar to auxiliary variables in that they are not directly associated with a single Xgas. Unlike auxiliary variables, these cannot be simply copied from the private netCDF file. Instead, they must be computed from one or more private variables. Because of that, there are a specific set of these variables pre-defined by the public writer. Currently only one computed variable type exists, "prior_source". You can specify it in the configuration as follows:

[[computed]]
type = "prior_source"

By default, this creates a public variable named "apriori_data_source". You can change this with the public_name field, e.g.:

[[computed]]
type = "prior_source"
public_name = "geos_source_set"

Xgas discovery

Usually we do not want to specify every single Xgas to copy; instead, we want the writer to scan the private file to identify Xgases and copy everything that matches. This both saves a lot of tedious typing in the configuration and minimizes the possibilty of copy-paste errors.

Rules

The first part of the discovery section is a list of rules for how to find Xgas variables. These come in two variants:

Suffix rules: these look for variables that start with something starting with an Xgas-like pattern and ending in the given suffix. The full regex is ^x([a-z][a-z0-9]*)_{suffix}$, where {suffix} is the provided suffix. Note that the suffix is passed through [regex::escape] to ensure that any special characters are escaped; it will only be treated as a literal.
Regex rules: these allow you to specify a regular expression to match variables names. The regex must include a named capture group with the name "gas" that extracts the physical gas abbreviation (i.e., the gas value in an Xgas entry). This looks like (?<gas>...) where the ... is the regular subexpression that matches that part of the string.

By default, the configuration will add a single regex rule that matches the pattern ^x(?<gas>[a-z][a-z0-9]*)$. You can disable this by setting xgas_rules = false in the [defaults] section of the config. This rule is designed to match basic Xgas variables, e.g., "xch4", "xn2o", etc.

An example of a regular expression rule that uses the default ways to infer its ancillary variables is:

[[discovery.rule]]
regex = '^column_average_(?<gas>\w+)$'

Two things to note are:

The regular expression is inside single quotes; this is how TOML specifies literal strings and it the most convenient way to write regexes that include backslashes. (Otherwise TOML itself will intepret them as escape characters.)
The regex includes ^ and $ to anchor the start and end of the pattern. In most cases, you will probably want to do so as well to avoid matching arbitrary parts of variable names.

An example of a suffix rule that also indicates that variables matching this rule should not include averaging kernels or the traceability scale is:

[[discovery.rule]]
suffix = "mir"
ak = { type = "omit" }
traceability_scale = { type = "omit" }

Note that the suffix rule contains a "suffix" key, while the regular expression rule has a "regex" key - this is how they are distinguished. Also note that rules are checked in order, and a variable is added following the first rule that matches. This means that if a variable matches multiple rules, then its ancillary variables will be set up following the first rule that matched.

Attributes

Discovery rules can specify the fields xgas_attr_overrides, xgas_error_attr_overrides, prior_profile_attr_overrides, prior_xgas_attr_overrides, and ak_attr_overrides to set attributes on their respective variables. These should be used for attributes that will be the same for all the variables of that type created by this rule. For example, to add a cautionary note about experimental data to our previous mid-IR discovery rule:

[[discovery.rule]]
suffix = "mir"
xgas_attr_overrides = { note = "Experimental data, use with caution!" }
ak = { type = "omit" }
traceability_scale = { type = "omit" }

Ancillary variables

The rules also include default settings for the prior profile, prior column average, averaging kernel (and its slant Xgas bins), and the traceability scale. These can be specified the same way as described in the Xgases ancillary subsection, and the defaults are the same as well. However only the inferred and omit types may be used, as specified does not make sense when a rule may apply to more than one Xgas.

Exclusions

The second part of the discovery section are lists of gases or variables to exclude. The first option is excluded_xgas_variables. If a variable's private file name matches one of the names in that list, it will not be copied even if it matches one of the rules. The other option is excluded_gases, which matches not the variable name, but the physical gas. The easiest way to explain this is to consider the standard TCCON configuration:

[discovery]
excluded_xgas_variables = ["xo2"]
excluded_gases = ["th2o", "fco2", "zco2"]

excluded_xgas_variables specifically excludes the "xo2" variable; this would match the default regex rule meant to capture Xgases measured on the primary detector, but we don't want to include it because it is not useful for data users. However, O2 measured on a silicon detector may be useful, so we do not want to exclude all O2 variables. excluded_gases lists three gases that we want to exclude no matter what detector they are retrieved from. "fco2" and "zco2" are diagnostic windows (for channelling and zero-level offset, respectively) and so will be included once for each detector. "th2o" is temperature sensitive water, which is generally confusing for the average user, so we want to ensure that it is also excluded from every detector.

Explicitly specified Xgases

This section allows you to list specific Xgas variables to copy, along with some or all of the ancillary variables needed to properly interpret them. Usually, you will not specify each Xgas by hand, but instead will use the discovery capability of the writer to automatically find each Xgas to copy. However, variables explicitly listed in this section take precedence over those found by the discovery rules. This leads to two cases where you might specify an Xgas in this section:

The Xgas does not follow the common naming convention, thus making it difficult to discover with simple rules, or difficult to map from the Xgas variable to the ancillary variable.
The Xgas needs a different way to handle one of its ancillary variables that the default discovery does.

Each Xgas has the following options:

xgas (required): the variable name
gas (required): the physical gas name, e.g., "co2" for all the various CO2 variables (regular, wco2, and lco2). This is used to match up to, e.g., the priors which do not distinguish between the different spectra windows.
gas_long (optional): the full name of the gas instead of its abbreviation, e.g. "carbon dioxide" for CO2. If not given, then the configuration will try to find the gas value in its [gas_long_names] section and use that, falling back on the gas value if the gas is not defined in the gas long names section.
xgas_attr_overrides (optional): a table of attribute values that can override existing attribute values on the private Xgas variable.
xgas_error_attr_overrides (optional): a table of attribute values that can override existing attribute values on the private Xgas error variable.
prior_profile (optional): how to copy the a priori profile.
prior_profile_attr_overrides (optional): a table of attribute values that can override existing attribute values on the private prior profile variable.
prior_xgas (optional): how to copy the a priori column average.
prior_xgas_attr_overrides (optional): a table of attribute values that can override existing attribute values on the private prior Xgas variable.
ak (optional): how to copy the averaging kernels.
ak_attr_overrides (optional): a table of attribute values that can override existing attribute values on the private AK variable.
slant_bin (optional): how to find the slant Xgas bin variable needed to expand the AKs.
traceability_scale (optional): how to find the variable containing the WMO or analogous scale for this data.
required (optional): this is true by default; set it to false if you want to copy this Xgas if present but it is not an error if it is missing.

prior_profile, prior_xgas, ak, slant_bin, and traceability_scale can all be defined following the syntax in the ancillary variable specifications. Note that slant_bin is a special case in that it will only be used if the AKs are to be copied, but cannot be omitted in that case.

To illustrate the two main use cases for this section, here is an excerpt from the standard TCCON configuration:

[[xgas]]
xgas = "xluft"
gas = "luft"
gas_long = "dry air"
prior_profile = { type = "omit" }
prior_xgas = { type = "omit" }
ak = { type = "omit" }

[[xgas]]
xgas = "xco2_x2019"
gas = "co2"
prior_profile = { type = "specified", only_if_first = true, private_name = "prior_1co2", public_name = "prior_co2" }
prior_xgas = { type = "specified", only_if_first = true, private_name = "prior_xco2_x2019", public_name = "prior_xco2" }
ak = { type = "specified", only_if_first = true, private_name = "ak_xco2" }
slant_bin = { type = "specified", private_name = "ak_slant_xco2_bin" }

[[xgas]]
xgas = "xwco2_x2019"
gas = "co2"
prior_profile = { type = "specified", only_if_first = true, private_name = "prior_1co2", public_name = "prior_co2" }
prior_xgas = { type = "specified", only_if_first = true, private_name = "prior_xwco2_x2019", public_name = "prior_xco2" }
ak = { type = "specified", only_if_first = true, private_name = "ak_xwco2" }
slant_bin = { type = "specified", private_name = "ak_slant_xwco2_bin" }

First we have xluft. This variable would be discovered by the default rule; however, that rule will require prior information and AKs. The prior information is not useful for Xluft, so we want to avoid copying that to reduce the number of extraneous variables, and there are no AKs for Xluft. Thus we specify "omit" for each of these to tell the writer not to look for them. We do not have to tell it to omit slant_bin, because omitting the AKs implicitly skips that, and traceability_scale can be left as normal because there is an "aicf_xluft_scale" variable in the private netCDF files.

Second we have xco2_x2019 and xwco2_x2019. (We have omitted the x2007 variables and lco2_x2019 from the above example for brevity). These would not be discovered by the default rule. Further, the mapping to their prior and AK variables is unique: all the CO2 Xgas variables can share the prior profiles and column averages, and each "flavor" of CO2 (regular, wCO2, or lCO2) can use the same AKs whether it is on the X2007 or X2019 scale. Thus, we not only define that these variables need copied, but that we want to rename the prior variables to just "prior_co2" and "prior_xco2" and only copy these the first time we find them. We also ensure that the AKs and slant bins point to the correct variables.

If we wanted to set the note attribute of xluft, we could do that like so:

[[xgas]]
xgas = "xluft"
gas = "luft"
gas_long = "dry air"
prior_profile = { type = "omit" }
prior_xgas = { type = "omit" }
ak = { type = "omit" }
xgas_attr_overrides = { note = "This is a diagnostic quantity" }

Note

Not all attributes can be overridden, some are set internally by the netCDF writer to ensure consistency. If an attribute is not getting the value you expect, first check the logging output from the netCDF writer for a warning that a particular attribute cannot be set.

Ancillary variable specifications

The ancillary variables (prior profile, prior Xgas, AK, slant bin, and traceability scale) can be defined as one of the following three types:

inferred: indicates that this ancillary variable must be included and should not conflict with any other variable. The private and public variable names will be inferred from the Xgas and gas names. This type has two options:
- only_if_first: a boolean (false by default) that when set to true will skip copying the relevant variable if a variable with the same public name is already in the public file. Note that the writer does not check that the existing variable's data are equal to what would be written for the new variable!
- required: a boolean (true by default) that when set to false allows this variable to be missing from the private file. This is intended for Xgas discovery rules more than explicit Xgas definitions.
specified: allows you to specify exactly which variable to copy with the private_name field. You can also give the public_name field to indicate what the variable name in the output file should be; if that is omitted, then the public variable will have the same name as the private variable. It is an error if the public variable already exists. The also allows the only_if_first field, which behaves how it does for the inferred type.
omit: indicates that this variable should not be copied.

In the following example, the prior Xgas field shows the use of the inferred options, the prior profile field shows the use of the specified options, and the AK field the use of omit.

[[xgas]]
xgas = "xhcl"
gas = "hcl"
prior_xgas = { type = "inferred", only_if_first = true, required = false }
prior_profile = { type = "specified", only_if_first = true, private_name = "prior_1hcl", public_name = "prior_hcl" }
ak = { type = "omit" }

Ancillary variable name inference

The writer uses the following rules when asked to infer ancillary variable names. In these, {xgas_var} indicates the Xgas variable name and {gas} the physical gas name.

prior_profile: looks for a private variable named prior_1{gas} and writes to a variable named prior_{gas}.
prior_xgas: looks for a private variable named prior_{xgas_var} and writes to the same variable.
ak: looks for a private variable named ak_{xgas_var} and writes to the same variable.
slant_bin: looks for a private variable named ak_slant_{xgas_var}_bin. This is not written, it is only used to expand the AKs to one-per-spectrum.
traceability_scale: looks for a private variable named aicf_{xgas_var}_scale. The result is always written to the wmo_or_analogous_scale attribute of the Xgas variable; that cannot be altered by this configuration.

Gas proper names

Ideally, all Xgases should include their proper name in the long_name attribute, rather than just its abbreviation. This section allows you to map the formula (e.g., "co2") to the proper name (e.g., "carbon dioxide"), e.g.:

[gas_long_names]
co2 = "carbon dioxide"
ch4 = "methane"
co = "carbon monoxide"

Note that the keys are the gases, not Xgases. A default list is included if not turned off in the [Defaults] section. See the source code for DEFAULT_GAS_LONG_NAMES for the current list. You can override any of those without turning off the defaults; e.g., setting h2o = "dihydrogen monoxide" in this section will replace the default of "water".

Of course, when explicitly defining an Xgas to copy, you can write in the proper name as the gas_long value. The [gas_long_names] section is most useful for automatically discovered Xgases, but it can also be useful when defining multiple Xgas variables that refer to the same physical gas, as the standard TCCON configuration does with CO2.

Global attributes

What global attributes (i.e., attributes from the root group of the netCDF file) to copy to the public file are defined by the [global_attributes] section. This section contains two lists of strings:

must_copy lists the names of attributes that must be available in the private netCDF file, or an error is raised.
copy_if_present lists the names of attributes to copy if available in the private netCDF file, but to not raise an error for if missing.

An abbreviated example from the TCCON standard configuration is:

[global_attributes]
must_copy = [
    "source",
    "description",
    "file_creation",
]
copy_if_present = [
    "long_name",
    "location",
]

At present, there is no way to manipulate attributes' values during the copying process, nor add arbitrary attributes. In general, attributes should be added to the private netCDF file, then copied to the public file. This ensures that attributes are consistent between the two files. However, in the future we may add the ability to define some special cases.

Warning

The history attribute is a special case, it will always be created or appended to following the CF conventions, no matter what the configuration says. To avoid conflicts with this built in behavior, do not specify history as an attribute to copy in the configuration file.

Including other configurations

The public netCDF writer configuration can extend other configuration files. The intended use is to define a base configuration with a set of variables that should always be included, and extend that for different specialize use cases. The TCCON standard configurations use this to reuse the standard configuration (including all of the auxiliary variables, normal computer variables, and Xgas values from the InGaAs detector) in the extended configuration (which adds the InSb or Si Xgas values).

To use another configuration file, use the include key:

include = ["base_config.toml"]

This is a list, so you can include multiple files:

include = ["base_config.toml", "extra_aux.toml"]

Info

Currently, when giving relative paths as the values for include as done here, they are interpreted as relative to the current working directory. However, you should not rely on that behavior - the intention is to adjust this so that relative paths are interpreted as relative to the configuration file in which they appear. If you have a global set of configuration files that use include, for now, it is best to use absolute paths in their respective include sections.

How configurations are combined

Internally, write_public_netcdf uses the figment crate to combine the configurations. Specifically, it uses the "adjoin" conflict resolution strategy. This means that lists (like the explicit Xgas definitions) from each configuration will be concatenated, and scalar values will be taken from the first configuration that defines them. (The order in which the configurations are parsed is defined next).

Order of inclusion

The include key is recursive, so if file1.toml includes file2.toml, and file2.toml includes file3.toml, then file1.toml will include the combination of file2.toml and file3.toml. When files include more than one other file, the algorithm does a "depth-first" ordering. That is, if our top-level configuration has:

# top.toml
include = ["middle1.toml", "middle2.toml"]

middle1.toml has:

# middle1.toml
include = ["bottom1a.toml", "bottom1b.toml"]

and middle2.toml has:

# middle2.toml
include = ["bottom2a.toml", "bottom2b.toml"]

then the order in which the files are added to the configuration is:

top.toml
middle1.toml
bottom1a.toml
bottom1b.toml
middle2.toml
bottom2a.toml
bottom2b.toml

In other words, coupled with the "adjoin" behavior used, this ensures that the inclusion behavior is as you expect. Settings in top.toml take precedence, then all settings in middle1.toml (whether they are defined in middle1.toml itself or one of its included files), and then middle2.toml (and its included files) comes last.

Warning

Although you can create complex hierarchies of configurations as shown in this example, doing so is generally not a good idea. The more complicated you try to make the set of included files, the more likely you are to end up with unexpected results - duplicated variables, wrong attributes, etc. If you find yourself using more than one layer of inclusion, you may be better off simply creating one large configuration file with the necessary parts of the other files copied into it.

Defaults

Unlike other sections, the [defaults] section does not define variables to copy; instead, it modifies how the other sections are filled in. If this section is omitted, then each of the other sections will add reasonable default values if omitted. The following boolean options are available to change that behavior:

disable_all: setting this to true will ensure that no defaults are added in any section.
aux_vars: setting this to false will prevent TCCON standard auxiliary variables from being added in the aux section.
gas_long_names: setting this to false will prevent the standard mapping of chemical formulae to proper gas names being added to [gas_long_names].
xgas_rules: setting this to false will prevent the standard list of patterns to match when looking for Xgas variables from being added to [discovery.rules].

Debugging tips

Configuration parsing errors

If you are trying to use a custom configuration and the writer gives an error that it cannot "deserialize" the file, that means that there is either a TOML syntax error or another typo. To narrow down where the problem is, incrementally comment out parts of the configuration file and run the writer with the --check-config-only flag. When this can parse the configuration, it will print an internal representation of it to the screen. Thus, when it starts working, whatever section you commented out is the likely culprit.

Unexpected output (such as missing or duplicated variables)

If you seem to be missing variables in the output, have the same variable try to be written twice, or other problems when using a custom configuration, first run the writer with the --check-config-only flag and carefully examine the printed parsed version of the configuration. This can help check if the configuration is being interpreted as you intended, especially when using the include feature

Checking on Xgas discovery

If variables are not being copied correctly, increase the verbosity of write_public_netcdf by adding -v or -vv to the command line. The first will activate debug output, which includes a lot of information about Xgas discovery. -vv will also activate trace-level logging, which will output even more information about the configuration as the program read it.

Site metadata file

write_public_netcdf can take the data latency/release lag from a TOML file that specifies site metadata. There is an example for the standard TCCON sites in the repo. This file must have top-level keys that match site two-character IDs, each containing a number of metadata values. For example:

[pa]
long_name = "parkfalls01"
release_lag = 120
location = "Park Falls, Wisconsin, USA"
contact = "Paul Wennberg <wennberg@gps.caltech.edu>"
data_revision = "R1"
data_doi = "10.14291/tccon.ggg2014.parkfalls01.R1"
data_reference = "Wennberg, P. O., C. Roehl, D. Wunch, G. C. Toon, J.-F. Blavier, R. Washenfelder, G. Keppel-Aleks, N. Allen, J. Ayers. 2017. TCCON data from Park Falls, Wisconsin, USA, Release GGG2014R1. TCCON data archive, hosted by CaltechDATA, California Institute of Technology, Pasadena, CA, U.S.A. http://doi.org/10.14291/tccon.ggg2014.parkfalls01.R1"
site_reference = "Washenfelder, R. A., G. C. Toon, J.-F. L. Blavier, Z. Yang, N. T. Allen, P. O. Wennberg, S. A. Vay, D. M. Matross, and B. C. Daube (2006), Carbon dioxide column abundances at the Wisconsin Tall Tower site, Journal of Geophysical Research, 111(D22), 1-11, doi:10.1029/2006JD007154. Available from: https://www.agu.org/pubs/crossref/2006/2006JD007154.shtml"

[oc]
long_name = "lamont01"
release_lag = 120
location = "Lamont, Oklahoma, USA"
contact = "Paul Wennberg <wennberg@gps.caltech.edu>"
data_revision = "R1"
data_doi = "10.14291/tccon.ggg2014.lamont01.R1/1255070"
data_reference = "Wennberg, P. O., D. Wunch, C. Roehl, J.-F. Blavier, G. C. Toon, N. Allen, P. Dowell, K. Teske, C. Martin, J. Martin. 2017. TCCON data from Lamont, Oklahoma, USA, Release GGG2014R1. TCCON data archive, hosted by CaltechDATA, California Institute of Technology, Pasadena, CA, U.S.A. https://doi.org/10.14291/tccon.ggg2014.lamont01.R1/1255070"
site_reference = ""

Although the public netCDF writer only uses release_lag, each site must contain the following keys for this file to be valid:

long_name: the site's location readable name followed by a two-digit number indicating which instrument at that site this is.
release_lag: an integer >= 0 specifying how many days after acquisition data should be kept private. TCCON sites are not permitted to set this >366, as data delivery to the public archive within one year is a network requirement.
location: a description of the location where the instrument resides. This is usually "city, state/province/country", but can include things such as institution if desired.
contact: the name and email address, formatted as NAME <EMAIL> of the person users should contact with questions or concern about this site's data.
data_revision: an "R" followed by a number >= 0 indicating which iteration of GGG2020 reprocessing this data represents. This should be incremented whenever previously public data was reprocessed to fix an error.

The following keys may be provided, but are not required:

data_doi: A digital object identifier that points to the public data for this site. This should be included if possible; it is optional only so that public files can be created before the first time a DOI is minted.
data_reference: A reference to a persistent repository where the data can be downloaded. For TCCON sites, this will be CaltechData. For other instruments, this may vary for now.
site_reference: A reference to a publication describing the site location itself (as opposed to the data).

TCCON sites can find the most up-to-date versions of their values for this metadata at https://tccondata.org/metadata/siteinfo/. Other users should do their best to ensure that the above conventions are followed.

Keyboard shortcuts

GGG-RS User Guide