Downloading global CMIP data

This notebook demonstrates how to download global CMIP6 files based on your criteria (variables, models, experiments, etc.). This notebook only walks through the process of downloading the raw CMIP6 files, not formatting them for ELM [funcationality does not yet exist].

dapper uses a Pangeo-hosted CMIP repository, as we found that ESGF was kinda tricky because of the transience and availability of nodes. The Pangeo archive standardizes everything into a quickly-searchable and downloadable archive, but it is not a perfect mirror of all the available data across ESGF. If you’re not finding what you need here, you may have to look in ESGF. Note that Google Earth Engine also hosts a downscaled set of CMIP6 models/variables, but unfortunately it includes only a limited set of variables–not everything needed for ELM runs, so we do not provide functionality for sampling it.

Searching and downloading from the Pangeo archive does not require an account, so unlike ERA5-Land data that needs a Google Earth Engine account, this should work straight out of the box.

Similar to working with ERA5-Land Hourly data, here we will specify a params dictionary and then send our request. Let’s look at these params a little bit here.

Key

Definition

Examples

models

Climate models (or “sources”) that produced the simulation data, each with unique physics, resolution, and configurations.

CESM2, IPSL-CM6A-LR, CanESM5, MPI-ESM1-2-HR

variables

Climate variables simulated by the models, including atmospheric, oceanic, and land-surface data.

pr, tas, psl, ua

experiment

Predefined scenarios that specify forcing conditions used in climate simulations.

historical, ssp245, ssp370, ssp585, piControl

table

Frequency and domain of the model output data.

Amon, day, Omon, Lmon

ensemble

Identifier specifying realization, initialization, physics, and forcing configurations for the model run.

r1i1p1f1, r2i1p1f1, r1i2p1f2

You do not need to specify all of these. For example, if you’re not sure which models you want, just leave it out and you’ll be returned with all the models that match your other criteria. Let’s try it out.

from pathlib import Path
from dapper.met import cmip_utils as cu

# We will leave model selections out for now
params = {
    "variables": ["pr", "tas"],
    "experiment": "historical",
    "table": ["Amon"],
    "ensemble": "r1i1p1f1",
}

available = cu.find_available_data(params)

print(available)
    activity_id       institution_id          source_id experiment_id  \
0          CMIP            NOAA-GFDL           GFDL-CM4    historical   
1          CMIP            NOAA-GFDL           GFDL-CM4    historical   
2          CMIP                 IPSL       IPSL-CM6A-LR    historical   
3          CMIP                 IPSL       IPSL-CM6A-LR    historical   
4          CMIP            NASA-GISS        GISS-E2-1-G    historical   
..          ...                  ...                ...           ...   
103        CMIP                 IPSL  IPSL-CM6A-LR-INCA    historical   
104        CMIP                KIOST          KIOST-ESM    historical   
105        CMIP                KIOST          KIOST-ESM    historical   
106        CMIP  EC-Earth-Consortium      EC-Earth3-Veg    historical   
107        CMIP  EC-Earth-Consortium      EC-Earth3-Veg    historical   

    member_id table_id variable_id grid_label  \
0    r1i1p1f1     Amon          pr        gr1   
1    r1i1p1f1     Amon         tas        gr1   
2    r1i1p1f1     Amon          pr         gr   
3    r1i1p1f1     Amon         tas         gr   
4    r1i1p1f1     Amon         tas         gn   
..        ...      ...         ...        ...   
103  r1i1p1f1     Amon         tas         gr   
104  r1i1p1f1     Amon         tas        gr1   
105  r1i1p1f1     Amon          pr        gr1   
106  r1i1p1f1     Amon          pr         gr   
107  r1i1p1f1     Amon         tas         gr   

                                                zstore  dcpp_init_year  \
0    gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...             NaN   
1    gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-CM4/histo...             NaN   
2    gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...             NaN   
3    gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR/histor...             NaN   
4    gs://cmip6/CMIP6/CMIP/NASA-GISS/GISS-E2-1-G/hi...             NaN   
..                                                 ...             ...   
103  gs://cmip6/CMIP6/CMIP/IPSL/IPSL-CM6A-LR-INCA/h...             NaN   
104  gs://cmip6/CMIP6/CMIP/KIOST/KIOST-ESM/historic...             NaN   
105  gs://cmip6/CMIP6/CMIP/KIOST/KIOST-ESM/historic...             NaN   
106  gs://cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-E...             NaN   
107  gs://cmip6/CMIP6/CMIP/EC-Earth-Consortium/EC-E...             NaN   

      version  
0    20180701  
1    20180701  
2    20180803  
3    20180803  
4    20180827  
..        ...  
103  20210216  
104  20210601  
105  20210928  
106  20211207  
107  20211207  

[108 rows x 11 columns]

Now we see a table where each row corresponds to a dataset. Note that each variable will be on its own row, even if it comes from the same model, table, experiment, and ensemble. Let’s say that you only want 10 samples of both pr and tas instead of the full catalog. We will do this by specifying the Index of available.df for the rows we want to keep. Here, we’ll find 5 of these indexes.

💡 Note: You only need to do this step if you want to downselect from your returned query.
df = available.copy()

# Find models that have both 'pr' and 'tas' variables
grouped = df.groupby("source_id")
keep = []
count = 0
for model, g in grouped:
    if "tas" in g["variable_id"].values and "pr" in g["variable_id"].values:
        keep.extend(g.index.tolist())
        count = count + 1
    if count > 4:
        break
df_export = df.iloc[keep]
print(df_export)  # Now we have 10 models
   activity_id institution_id       source_id experiment_id member_id  \
54        CMIP   CSIRO-ARCCSS      ACCESS-CM2    historical  r1i1p1f1   
55        CMIP   CSIRO-ARCCSS      ACCESS-CM2    historical  r1i1p1f1   
59        CMIP          CSIRO   ACCESS-ESM1-5    historical  r1i1p1f1   
60        CMIP          CSIRO   ACCESS-ESM1-5    historical  r1i1p1f1   
78        CMIP            AWI   AWI-CM-1-1-MR    historical  r1i1p1f1   
87        CMIP            AWI   AWI-CM-1-1-MR    historical  r1i1p1f1   
72        CMIP            AWI  AWI-ESM-1-1-LR    historical  r1i1p1f1   
73        CMIP            AWI  AWI-ESM-1-1-LR    historical  r1i1p1f1   
6         CMIP            BCC     BCC-CSM2-MR    historical  r1i1p1f1   
7         CMIP            BCC     BCC-CSM2-MR    historical  r1i1p1f1   

   table_id variable_id grid_label  \
54     Amon         tas         gn   
55     Amon          pr         gn   
59     Amon          pr         gn   
60     Amon         tas         gn   
78     Amon          pr         gn   
87     Amon         tas         gn   
72     Amon         tas         gn   
73     Amon          pr         gn   
6      Amon         tas         gn   
7      Amon          pr         gn   

                                               zstore  dcpp_init_year  \
54  gs://cmip6/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/...             NaN   
55  gs://cmip6/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/...             NaN   
59  gs://cmip6/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/hist...             NaN   
60  gs://cmip6/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/hist...             NaN   
78  gs://cmip6/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/histor...             NaN   
87  gs://cmip6/CMIP6/CMIP/AWI/AWI-CM-1-1-MR/histor...             NaN   
72  gs://cmip6/CMIP6/CMIP/AWI/AWI-ESM-1-1-LR/histo...             NaN   
73  gs://cmip6/CMIP6/CMIP/AWI/AWI-ESM-1-1-LR/histo...             NaN   
6   gs://cmip6/CMIP6/CMIP/BCC/BCC-CSM2-MR/historic...             NaN   
7   gs://cmip6/CMIP6/CMIP/BCC/BCC-CSM2-MR/historic...             NaN   

     version  
54  20191108  
55  20191108  
59  20191115  
60  20191115  
78  20200511  
87  20200720  
72  20200212  
73  20200212  
6   20181126  
7   20181126  

Now that we have the set of models we want to download, let’s download them! Note that these are on the order of 100-500 MB apiece so if you’re just following this example, you may want to halt early or shrink df_export even further.

# Choose an output folder for the downloaded NetCDF files
# (relative paths are interpreted from your current working directory)
CMIP_OUT = Path(r'X:\Research\NGEE Arctic\CMIP output\dapper_tutorial') # Change this or you'll have a bad time
CMIP_OUT.mkdir(parents=True, exist_ok=True)

cu.download_pangeo(df_export, CMIP_OUT)

And if we look in CMIP_OUT, we should see all the files.

# Quick sanity check: show a few of the downloaded files
if not CMIP_OUT.exists():
    raise FileNotFoundError(
        f"Output directory not found: {CMIP_OUT.resolve()}\n"
        "Run the download cell above first, or update `path_out` to your chosen location."
    )

files = sorted([p for p in CMIP_OUT.rglob("*") if p.is_file()])

print(f"Downloaded {len(files)} files into: {CMIP_OUT.resolve()}")
for p in files[:20]:
    print(" -", p.relative_to(CMIP_OUT))
Downloaded 10 files into: X:\Research\NGEE Arctic\CMIP output\dapper_tutorial
 - pr_ACCESS-CM2_historical_r1i1p1f1.nc
 - pr_ACCESS-ESM1-5_historical_r1i1p1f1.nc
 - pr_AWI-CM-1-1-MR_historical_r1i1p1f1.nc
 - pr_AWI-ESM-1-1-LR_historical_r1i1p1f1.nc
 - pr_BCC-CSM2-MR_historical_r1i1p1f1.nc
 - tas_ACCESS-CM2_historical_r1i1p1f1.nc
 - tas_ACCESS-ESM1-5_historical_r1i1p1f1.nc
 - tas_AWI-CM-1-1-MR_historical_r1i1p1f1.nc
 - tas_AWI-ESM-1-1-LR_historical_r1i1p1f1.nc
 - tas_BCC-CSM2-MR_historical_r1i1p1f1.nc

What next?

dapper does not yet have an Adapter() for CMIP data, but it’s coming! You can use these “raw” downloads for analysis until then.