•  Icon
  •  Icon
  •  Icon
  •  Icon
  •  Icon

How to create DMR++ sidecar files?

Lets being by understanding what are DMRs. A DMR is an xml file that describes what is inside a file containing data stored in a format traditionally used for storing scientific data such as NetCDF, HDF4, HDF5, CSV, etc. For example, DMRs hold information about the variables names, types, sizes, their attributes, dimensions, etc., and enable any user or API to understand what is inside a file without opening the file. Such metadata / data separation is what enables “lazy reads / evaluations”, and has been a core principle of OPeNDAP protocols since its inception.

DMRs follow a schema used by Hyrax (OPeNDAP Inc.’s server) and Thredds (Unidata’s OPeNDAP server) data servers, to describe the metadata in the DAP4 protocol. DMRs thus replace the .DDS and .DAS in the DAP2 metadata representation. To inspect a DMR xml file of a OPeNDAP data url, you append “.xml” at the end of the data url, and paste it onto a browser.

For example, you can try the following data URL below to inspect the DMR of the test HDF5 dataset SimpleGroup:


http://test.opendap.org:8080/opendap/dap4/SimpleGroup.nc4.h5.dmr.xml

  

What about the “++” in DMR++? Well, that is unique to Hyrax data servers that are developed and maintained by OPeNDAP, Inc. The “++”element of DMR++ refers to “extra” information such as chunk references, byte offsets, and compression information that can enable client APIs to retrieve subsets of data in an efficient way (see Figure 1). In some abstract way, the DMR++ is a map of where the bytes live within a single file, and so our Hyrax data server (or any other software that can parse the DMR++) knows which chunk to access when retrieving a data subset, without first “breaking the file apart” into individual chunks (i.e. transform the file into a Zarr store) and thus keeping the data integrity.

DMR++ files are useful for scalable, “serverless” cloud computing as they can significantly speed up scientific analysis workflows, data exploration, thus accelerating scientific discovery. With the DMR++, a NetCDF4, HDF4 or even HDF5 can become ARCO (Analysis Ready and Cloud Optimized) dataset, without the need to break it apart into its individual constituents (chunks), and independent of the Level of Processing typical of hierarchical datasets (e.g. level 2, level 3 or level 4). Typically, DMR++ sidecar files are located in the same bucket as the original file they describe, although that is not necessary.

Figure 1. Diagram showing multiple HDF5 datasets within an S3 bucket, and their respective DMR++ side car file.

As of this moment, DMR++ can only be generated for HDF4, HDF5 and NetCDF4 datasets, widely used by NASA and specific to the Geosciences. If your organization would like to support development for a broader scope of DMR++, contact us at Work with Us. We would love to hear about your data needs, and that or your users.

Below we provide a simple workflow to produce DMR++ for PACE data, stored in NetCDF4. There are many access points for this dataset, one of which is OPeNDAP.

Requirements

  • Docker Instance running on the background.
  • This example will use a MacOS (although Linux OS also works).
  • NetCDF4, HDF5 or HDF4 data files on a local filesystem (data may be on an S3 bucket). Here, data will be located in `$prefix/OPeNDAP/DATA/`, where $prefix refers to the home directory. The dataset is NetCDF-4.

A Quick Note About DMR++ and NetCDF

  • DMR++ can only be created for NetCDF-4 files (NetCDF Enhanced), because these use the HDF5 library as its backend, and get_dmrpp was developed for HDF5 datasets thanks to generous funding from NASA.
  • DMR++ cannot be created for NetCDF-3 (NetCDF Classic).
  • An .nc file may be either NetCDF-4 or NetCDF-3. If you are not sure which version your dataset is, you can do ncdump -h followed by the filename.

Steps to Generate DMR++ side car files

  1. On a terminal window, pull the latest official release version of Hyrax from OPeNDAP’s Docker HUB.

docker pull opendap/hyrax:latest
  

  1. Run a Hyrax instance, specifying where your data is. In this example, the data is located in the folder $prefix/OPeNDAP/DATA.

docker run -d -h hyrax -p 8080:8080 \
--platform linux/amd64 \
--volume ~/OPeNDAP/DATA:/usr/share/hyrax \
--name=hyrax opendap/hyrax:latest
  

where ~ in the path refers to the home directory in OSX.

If you are running this tutorial on Linux, omit the platform definition. NOTE: At this point, you are running Hyrax data server and can look at the OPeNDAP landing page on your local browser by going to: http://localhost:8080/opendap/

  1. Activate the container’s bash shell

docker exec -it hyrax bash
  


Check the location of the files. Within hyrax’s shell, the data are located in /usr/share/hyrax. You can thus run


ls -l /usr/share/hyrax
  

  1. Consider the single file PACE_OCI.20250101.L3m.DAY.CHL.V3_0.chlor_a.4km.NRT.nc located in the data directory. You can now create a DMR++ associated with this single file by running the following command:


get_dmrpp -b `pwd` -o PACE_OCI.20250101.L3m.DAY.CHL.V3_0.chlor_a.4km.NRT.nc.dmrpp usr/share/hyrax/PACE_OCI.20250101.L3m.DAY.CHL.V3_0.chlor_a.4km.NRT.nc
  

After ~10 ms, there will be a new file named PACE_OCI.20250101.L3m.DAY.CHL.V3_0.chlor_a.4km.NRT.nc.dmrpp. This is the DMR++ file.

WARNING: The script get_dmrpp assumes the file is either HDF5 or NetCDF4. If it is NOT any of these formats, it will throw an error.

  1. EXTRA: In the case you have a collection of files, you can easily write a shell script to create a DMR++ side car file for each one of them. Simply do:

touch dmrpp_pace.sh
  

Add the following content


#!/bin/bash

# Data directory
directory="/usr/share/hyrax"
# will run command from /
command="get_dmrpp -b `pwd` -o "

# Loop through each file in the directory
for file in "$directory"/*; do
    if [ -f "$file" ]; then  # Check if it's a regular file
        # Construct the output file name
        output_file="${file}.dmrpp"

        # Executes get_dmrpp on each file
        $command "$output_file" "$file"
    fi
done
  

Add the right permissions to the dmrpp_pace.sh file:


chmod +x dmrpp_pace.sh
  

Finally, execute the shell script:


./dmrpp_pace.sh
  

After ~ N / 2 seconds (N being the number of files), there should be a dmrpp file associated with each NetCDF4 dataset.

Figure 2. Example DMR++ file associated with file PACE_OCI.20250101.L3m.DAY.CHL.V3_0.chlor_a.4km.NRT.nc file. The ++ element of the DMR (which together makes this a DMR++ ) for variable name ‘Lat’ is contained within the lines 27–34. Each variable array possesses a similar <dmrpp:chunks > </dmrpp:chunks> xml element.

NOTE that the shell script was executed from the root directory (within the Hyrax bash sell, do cd /), and all the dmrpp files were placed in the same directory as their associated dataset.

Difference between DMR and DMR++?

Most of the content of the dmrpp file is the DMR (Metadata representation of the file), which the Hyrax data server can create on the fly. The extra parts that are added to the DMR, i.e. the ++, are all the elements defined via <dmrpp:chunks ...> </dmrpp:chunks> . These are all chunk references. For example, in Figure 2, the code block associated with the variable lat, shows information about each chunk, the bytes and their offset, compression, etc.