•  Icon
  •  Icon
  •  Icon
  •  Icon

Access Cloud and Aerosol Lidar data from NASA’s Calipso

Access Cloud and Aerosol Lidar data from NASA’s Calipso
Subset-driven, parallel access to cloud and aerosol lidar data from Calipso via OPeNDAP and Pydap, enabling efficient remote workflows. Credits: NASA Langley/Roman Kowch
Geospatial data icon About the Data
The dataset accessed in this tutorial is freely available and is the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) Lidar Level 2 1km resolution Cloud Layer data product. The cloud layer products consist of a sequence of column descriptors, each associated with a variable number of cloud layer descriptors that provide information necessary to study the many roles played by clouds and aerosols in Earth’s climate and weather. Source: NASA Earthdata.

Requirements

  • Earthdata login (EDL) credentials.
  • Concept Collection ID or DOI for the relevant data product.
  • Python >= 3.11.
  • Mamba-forge (or conda-forge) installed on the machine.
  • Familiarity with Jupyter notebooks and Jupyter Lab.

Optional:

  • Store all EDL credentials in a .netrc file.
  • Basic knowledge of conda environment installation.

Objectives

To download ten years of springtime Cloud and Aerosol data from NASA’s Calipso Level 2 Collection in the area around Isla de Guadalupe in Mexico. The spatial and temporal range is defined by the following parameters:

  • Time range: 10 years data spanning spring time 03/01 – 05/31 (2013-2023).
  • Spatial range: -121 < longitude < -115, and 26.5 < latitude < 31
  • Daytime data only.

The Data Challenge: Data stored in HDF4

All of our data of interest has Level 2 processing, meaning each file spans geographical location that follows the track of the satellite (i.e., swath data). This adds a layer of complication since we need to subset each remote file by its own subset slice. In addition to this, all the data from this collection is stored in a pre-cloud file format not optimized for Cloud Access: HDF4. With OPeNDAP, the access pattern remains unaltered, independent of the remote source file format. But access from HDF4 data can be slower compared to more modern HDF5, despite the fact that all data is in the cloud, since these formats were optimized for random access and for maximum write speed.

To accomplish the goals above, the tutorial will demonstrate how to:

  • Authenticate (via earthaccess).
  • Search for all available NASA OPeNDAP URLs for a specific NASA collection. The search will further filter by time range.
  • Subset with OPeNDAP, by variable name and spatial / temporal range.

Install required python dependencies

In a terminal shell, use mamba or conda forge to install all required dependencies to run this tutorial and activate the environment to run an interactive jupyter notebook on a browser.

Terminal
$ mamba create -n opendap_env -c conda-forge python=3.12 ipython jupyterlab earthaccess netCDF4
$ mamba activate opendap_env
$ pip install git+https://github.com/pydap/pydap.git
$ jupyter lab

Once in the jupyter notebook environment, import in the first cell all necessary methods that will be used to stream remote data into a local file:

Python
import xarray as xr
import datetime as dt
import earthaccess
import numpy as np

# import pydap-specific tools
from pydap.client import get_cmr_urls, open_url
from pydap.client import to_netcdf as dap_to_netcdf

The needed parameter to search for all cloud and aerosol lidar data from Calipso that is available through OPeNDAP is

Concept Collection ID = C3463063995-LARC_CLOUD

To learn how to find the concept collection id for a specific data product, click the button below:

Data from the above collection is a Level 2 data product, meaning SWATH data, and all remote files span different longitudes and latitudes. In this case, the necessary first step is to filter the search for all relevant data URLs by a bounding box. Later on, a further subset by coordinate values will be done by OPeNDAP.

Below are the required parameters to search for all OPeNDAP URLs using PyDAP's get_cmr_urls:

Python
# look up Concept Collection ID
Calipso_L2_ccid = "C3463063995-LARC_CLOUD" # 
bbox = [-121,26.5,-115,31] # [west, south, east, north]

# 10 years of Spring time data
time_ranges = [[dt.datetime(year, 3, 1), dt.datetime(year, 5, 31)] for year in range(2013, 2024)]

CMR_URLs = []
args = {
    "ccid": Calipso_L2_ccid,
    "bounding_box": bbox,
    "limit": 1000,
}
cmr_urls = [url for time_range in time_ranges for url in get_cmr_urls(**args, time_range=time_range)]

What does specifying a bounding box defining an area of interest to the CMR search do?

The CMR search filters using the bounding box, returning all the OPeNDAP URLs with data that intersects the bounding box. To ONLY get data within the bounding box, we will have to do some more work as described below.

How to download only the data within the bounding box?

With OPeNDAP it is a two stage process.

  1. Download coordinate data ONLY from each granule, to identify the slices needed to download only the data within the bounding box.
  2. Use the identified slice from each OPeNDAP URL, to stream data into a local file for analysis. PyDAP enables this.

This workflow is demonstrated below. But before we can download any data, one must authenticate via EDL.

EDL Authentication with earthaccess and OPeNDAP

There are various ways to authenticate with NASA, and here we will use earthaccess to retrieve a session object containing all required credentials to access data.

When using earthaccess to "login", you need to define a strategy and you have two options:

  • If you already have a .netrc file with your EDL credentials stored in your machine, set strategy="netrc"
  • If you DO NOT have a .netrc file with your EDL credentials, or you are not sure, do instead strategy="interactive"
Python
from earthaccess.exceptions import LoginStrategyUnavailable
try:
    auth = earthaccess.login(strategy="netrc", persist=True) 
except LoginStrategyUnavailable:
    # you will be prompted to add your EDL credentials
    auth = earthaccess.login(strategy="interactive", persist=True) 

# pass Token Authorization to a new Session.
# This will be used to download serialized binary data from OPeNDAP
my_session = session=auth.get_session()

Use OPeNDAP to subset data by coordinate values and variable names

Subset by variable names

This requires access to ONLY the metadata of the remote file.

Below we use PyDAP to download the OPeNDAP DAP4 metadata only. Pydap will create a Python representation of the dataset, including all variable names and their dimension, along with all metadata attributes associated with each variable. We will use this information to identify the variables of interest.

Python
pyds = open_url(cmr_urls[0], protocol="dap4", session=my_session)
pyds.tree()
HDF5 Tree
.CAL_LID_L2_01kmCLay-Standard-V5-00.2013-03-01T20-43-09ZD.hdf
├──Lidar_Surface_Detection
│  ├──Surface_Top_Altitude_532
│  ├──Surface_Base_Altitude_532
│  ├──Surface_Integrated_Attenuated_Backscatter_532
│  ├──Surface_532_Integrated_Depolarization_Ratio
│  ├──Surface_532_Integrated_Attenuated_Color_Ratio
│  ├──Surface_Detection_Flags_532
│  ├──Surface_Overlying_Integrated_Attenuated_Backscatter_532
│  ├──Surface_Scaled_RMS_Background_532
│  ├──Surface_Peak_Signal_532
│  ├──Surface_Detections_333m_532
│  ├──Surface_Top_Altitude_1064
│  ├──Surface_Base_Altitude_1064
│  ├──Surface_Integrated_Attenuated_Backscatter_1064
│  ├──Surface_1064_Integrated_Depolarization_Ratio
│  ├──Surface_1064_Integrated_Attenuated_Color_Ratio
│  ├──Surface_Detection_Flags_1064
│  ├──Surface_Overlying_Integrated_Attenuated_Backscatter_1064
│  ├──Surface_Scaled_RMS_Background_1064
│  ├──Surface_Peak_Signal_1064
│  └──Surface_Detections_333m_1064
├──Ocean_Derived_Column_Optical_Depth
│  ├──ODCOD_Effective_Optical_Depth_532
│  ├──ODCOD_Effective_Optical_Depth_532_Uncertainty
│  ├──ODCOD_QC_Flag_532
│  ├──ODCOD_Surface_Wind_Speeds_10m
│  └──ODCOD_Surface_Wind_Speed_Correction
├──Lidar_Data_Altitudes
├──Profile_ID
├──Latitude
├──Longitude
├──Profile_Time
├──Profile_UTC_Time
├──Day_Night_Flag
├──Off_Nadir_Angle
├──Solar_Zenith_Angle
├──Solar_Azimuth_Angle
├──Scattering_Angle
├──Spacecraft_Position
├──Parallel_Column_Reflectance_532
├──Parallel_Column_Reflectance_Uncertainty_532
├──Perpendicular_Column_Reflectance_532
├──Perpendicular_Column_Reflectance_Uncertainty_532
├──Column_Integrated_Attenuated_Backscatter_532
├──Column_IAB_Cumulative_Probability
├──Column_Particulate_Optical_Depth_Above_Opaque_Water_Cloud_532
├──Column_Particulate_Optical_Depth_Above_Opaque_Water_Cloud_Uncertainty_532
├──Tropopause_Height
├──Tropopause_Temperature
├──IGBP_Surface_Type
├──Snow_Ice_Surface_Type
├──DEM_Surface_Elevation
├──Minimum_Laser_Energy_532
├──Low_Energy_Mitigation_Column_QC_Flag
├──Number_Layers_Found
├──Scene_Flag
├──Low_Energy_Mitigation_Feature_QC_Flag
├──Layer_Top_Altitude
├──Layer_Base_Altitude
├──Layer_Top_Pressure
├──Midlayer_Pressure
├──Layer_Base_Pressure
├──Layer_Top_Temperature
├──Layer_Centroid_Temperature
├──Midlayer_Temperature
├──Layer_Base_Temperature
├──Opacity_Flag
├──Attenuated_Scattering_Ratio_Statistics_532
├──Attenuated_Backscatter_Statistics_532
├──Integrated_Attenuated_Backscatter_532
├──Integrated_Attenuated_Backscatter_Uncertainty_532
├──Attenuated_Backscatter_Statistics_1064
├──Integrated_Attenuated_Backscatter_1064
├──Integrated_Attenuated_Backscatter_Uncertainty_1064
├──Volume_Depolarization_Ratio_Statistics
├──Integrated_Volume_Depolarization_Ratio
├──Integrated_Volume_Depolarization_Ratio_Uncertainty
├──Attenuated_Total_Color_Ratio_Statistics
├──Integrated_Attenuated_Total_Color_Ratio
├──Integrated_Attenuated_Total_Color_Ratio_Uncertainty
├──Overlying_Integrated_Attenuated_Backscatter_532
├──Layer_IAB_QA_Factor
├──Feature_Classification_Flags
├──CAD_Score
├──Initial_CAD_Score
├──metadata.Product_ID
├──metadata.Date_Time_at_Granule_Start
├──metadata.Date_Time_at_Granule_End
├──metadata.Date_Time_of_Production
├──metadata.Number_of_Good_Profiles
├──metadata.Number_of_Bad_Profiles
├──metadata.Initial_Subsatellite_Latitude
├──metadata.Initial_Subsatellite_Longitude
├──metadata.Final_Subsatellite_Latitude
├──metadata.Final_Subsatellite_Longitude
├──metadata.Orbit_Number_at_Granule_Start
├──metadata.Orbit_Number_at_Granule_End
├──metadata.Orbit_Number_Change_Time
├──metadata.Path_Number_at_Granule_Start
├──metadata.Path_Number_at_Granule_End
├──metadata.Path_Number_Change_Time
├──metadata.Lidar_L1_Production_Date_Time
├──metadata.Number_of_Single_Shot_Records_in_File
├──metadata.Number_of_Average_Records_in_File
├──metadata.Number_of_Features_Found
├──metadata.Number_of_Cloud_Features_Found
├──metadata.Number_of_Aerosol_Features_Found
├──metadata.Number_of_Indeterminate_Features_Found
├──metadata.Ocean_Fresnel_Reflection_Coefficient_532
├──metadata.MERRA2_Wind_Uncertainty
├──metadata.AMSR_Wind_Correction_Uncertainty
├──metadata.Lidar_Data_Altitudes
├──metadata.GEOS_Version
├──metadata.GMAO_Files_Used
├──metadata.Classifier_Coefficients_Version_Number
├──metadata.Classifier_Coefficients_Version_Date
└──metadata.Production_Script

Stage 1

For SWATH data, Coordinate arrays such as Latitude and Longitude are NOT the dimensions of the dataset. The first step before downloading is identifying the dimensions associated with the coordinate data. We can do that with the PyDAP dataset object already created above.

We want to download the following variables identified by their fully qualifying name:

  • /latitude
  • /longitude
  • /Day_Night_Flag

Before downloading, we need to identify any dimension that is also array of the dataset. (There can be dimensions that are only named, meaning these variables exist only in the metadata but have no data associated in the file). The PyDAP dataset hold all the relevant information to identify any such variable in the remote file.

Python
DIMS = list(set(pyds['Latitude'].dims + pyds['Longitude'].dims + pyds['Day_Night_Flag'].dims))
dims = [dim for dim in DIMS if dim.split("/")[-1] in pyds[("/").join(DIMS[1].split('/')[:-1])].variables()] 
print("Dimensions that are also arrays: ", dims)
Jupyter cell
  Dimensions that are also arrays:  []

With this confirmation, we do not need any extra variables related to any coordinate data.

TIP when working with level 2 data
Our CMR query returned all remote files with any data intersecting our bounding box, which could be a single data point. To reduce the size of any irrelevant data that gets downloaded, we first need to identify the spatial subsets of data per file. A good first order approximation to identify data within the bounding box is to first download Longitude data per file, and use it to identify the coordinate subsets, per file. This approach helps greatly reduce the amount of unnecessary data downloaded (a more strict approach would be to download both latitude and longitude, but in the vast majority of cases downloading Longitude is enough). Below we follow this approach, in addition to downloading a metadata flag that identifies if data covers nighttime or daytime.

Python
# Download coordinate data into local directory
dap_to_netcdf(cmr_urls, session=my_session, 
              keep_variables= ["/Longitude", "Day_Night_Flag"],
              output_path=output_path) # <--------- you need to define your own output_path

Spatial subset of data, specified by dimension slices for each granule

Python
# Get data from Bounding Box
minLon, maxLon = bbox[0], bbox[2]

slices=[]
final_urls = []
for url in cmr_urls:
    filename = output_path+f"{url.split('/')[-1][:-4]}.nc4"
    dt1 = xr.open_datatree(filename).load()
    daytime_flag = dt1['Day_Night_Flag']
    # find index /data_01/longitude
    longitude = dt1['/Longitude']
    mask = (longitude >= minLon) & (longitude <= maxLon)
    idx = np.nonzero(mask.values)[0]
    daytime_flag = dt1['Day_Night_Flag'].isel(Record_Number=slice(idx[0], idx[-1]))==1
    if all(daytime_flag==0):
        final_urls.append(url)
        slices.append({"/Record_Number":(idx[0], idx[-1])})

print(f"\nOnly {len(final_urls)} out of the {len(cmr_urls)} remote files satisfy our Daylight Criteria!\n")
print("Sample subsetting slices:")
slices[:4]
Jupyter Cell Output

Only 268 out of the 550 remote files satisfy our Daylight Criteria!

Sample subsetting slices:

[{'/Record_Number': (12882, 15472)},
 {'/Record_Number': (14230, 16373)},
 {'/Record_Number': (11783, 14565)},
 {'/Record_Number': (13218, 15585)}]

Figure. Swath coordinate data for a remote file identifying the subset/data of interest. Note that in our code above Latitude data is not downloaded. It was downloaded for this remote file to confirm visually data is correctly selected with the slice.

Finally we clean the downloaded data to avoid filename collisions (NOTE replace output_path with your own!)

Terminal
$ cd output_path
$ rm CAL_LID_L2*.nc4

Stage 2

Now we stream ONLY the data of interest, applying subsets by Variable Names and Spatial subsetting using the slices variable we just calculated, to each remote granule that meets our daylight criteria.

Python
# Define variables to download
# Will Download a total of 34 Variables!
keep_variables = [
    '/Lidar_Surface_Detection', # <----- ALL Variables inside Group
    "/Ocean_Derived_Column_Optical_Depth", # < -- ALL Varibles inside Group
    "/Lidar_Data_Altitudes", "/Profile_ID", "/Latitude", "/Longitude", 
    "/Profile_Time", "/Profile_UTC_Time", "/Day_Night_Flag", "/Tropopause_Height", 
    "/Tropopause_Temperature",
]


# Stream the data with PyDAP data from the remote granules
# that meet the criteria, applying the subsetting slice to
# each of them, and to all variables BEFORE downloading
# (data-proximate subsetting!)

dap_to_netcdf(final_urls, session=my_session, 
              keep_variables = keep_variables,
              dim_slices= slices,
              output_path=output_path)

See the code in action below!

References

Getzewich, B. (2025). CALIPSO Lidar Level 2 1 km Cloud Layer, V5-00. NASA Langley Atmospheric Science Data Center Distributed Active Archive Center. https://doi.org/10.5067/CALIOP/CALIPSO/CAL_LID_L2_01KMCLAY-STANDARD-V5-00

Cite this Tutorial

Citation
Jimenez-Urias, M. A. (2026). Access Cloud and Aerosol Lidar (Swath) Data From CALIPSO Via OPeNDAP. Zenodo. https://doi.org/10.5281/zenodo.19477128
BibTeX
@misc{jimenez_urias_2026_19477128,
  author       = {Jimenez-Urias, Miguel Angel},
  title        = {Access Cloud and Aerosol Lidar (Swath) Data From
                   CALIPSO Via OPeNDAP
                  },
  month        = apr,
  year         = 2026,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.19477128},
  url          = {https://doi.org/10.5281/zenodo.19477128},
}