dapper.met package¶

Subpackages¶

dapper.met.adapters package

Submodules¶

dapper.met.cmip_utils module¶

CMIP6 utilities (Pangeo / intake-esm)

Design goals: - Separate search/listing of available datasets from sampling. - Cache the catalog so repeated searches don’t re-parse the JSON. - Support fast “dataset-first” sampling: open each remote zarr once, then compute means for many AOIs.

Typical workflow:: col = open_cmip6_catalog() df_all = search_cmip6(params, col=col) df_use = dedupe_latest(df_all) df_use = filter_complete(df_use, required_vars=params[“variables”]) df_use = df_use[df_use[“source_id”].isin([…])] # optional hard filter out = sample_bbox_means_for_aois(df_use, aois={…}, out_csv=…)

dapper.met.cmip_utils.bounds_from_geojson(path)[source]¶

Return type:: Tuple[Tuple[float, float], Tuple[float, float]]
Parameters:: path (str | Path)

Read a polygon GeoJSON/shapefile and return:: lat_bounds = (lat_min, lat_max) lon_bounds = (lon_min, lon_max)

Requires geopandas.

dapper.met.cmip_utils.cftime_date(string_date, sample_cftime)[source]¶

Convert YYYY-MM-DD to same cftime type as sample_cftime.

Parameters:: string_date (str)

dapper.met.cmip_utils.dedupe_latest(df)[source]¶

Deduplicate by keeping the latest ‘version’ (if present) for each dataset key.

In the Pangeo CMIP6 catalog, duplicates often exist for the same (model, experiment, member, table, grid, variable) with different versions.

Return type:: DataFrame
Parameters:: df (DataFrame)

dapper.met.cmip_utils.download_pangeo(df, dir_out, lat=None, lon=None, lat_bounds=None, lon_bounds=None, polygon_path=None)[source]¶

Download CMIP6 data from Pangeo to NetCDF, with optional spatial subsetting.

Note: only one of (lat/lon), (lat_bounds/lon_bounds), or (polygon_path) should be provided.

Parameters:

df (DataFrame)
dir_out (str | Path)
lat (float | None)
lon (float | None)
lat_bounds (Tuple[float, float] | None)
lon_bounds (Tuple[float, float] | None)
polygon_path (str | Path | None)

dapper.met.cmip_utils.extract_vars_from_files(files, start_date, end_date, path_out)[source]¶

Robust CMIP6 NetCDF merger for multiple calendars — using CFDatetimeCoder. This is slow but robust.

Parameters:

files (Iterable[str | Path])
start_date (str)
end_date (str)
path_out (str | Path)

dapper.met.cmip_utils.filter_complete(df, required_vars, group_cols=None)[source]¶

Keep only dataset groups that contain all required variables.

Return type:

DataFrame

Parameters:

df (DataFrame)
required_vars (Sequence[str])
group_cols (Sequence[str] | None)

By default, completeness is enforced per:: (source_id, experiment_id, member_id, table_id, grid_label)

This prevents “mixing” variables from different grids/members/experiments.

dapper.met.cmip_utils.find_available_data(params, col=None)[source]¶

Backwards-compatible wrapper around the new search/filter approach.

Note

This is now fast because open_cmip6_catalog is cached.
It enforces completeness across all requested variables.

Return type:: DataFrame
Parameters:: params (dict)

dapper.met.cmip_utils.open_cmip6_catalog(url='https://storage.googleapis.com/cmip6/pangeo-cmip6.json')[source]¶

Open (and cache) the Pangeo CMIP6 intake-esm catalog.

Important

intake-esm registers its plugin into intake, so you open via intake.open_esm_datastore(…).

Parameters:: url (str)

dapper.met.cmip_utils.sample_bbox_means_for_aois(df, aois, out_csv=None, time_min=None, time_max=None, show_progress=True, *, out_dir=None, chunk_format='parquet', resume=True, return_df=True, fail_log=None, retries=3, retry_backoff=1.0)[source]¶

Dataset-first sampler (pattern you wanted):

Loop over datasets (rows of df), open each zarr once
For that dataset, compute bbox-mean time series for each AOI

Crash-resilient mode (recommended):

Set out_dir=… to write ONE chunk file per dataset row as you go.
If resume=True, already-written chunks are skipped on rerun.
If return_df=True, the function returns the concatenation of all chunks in out_dir (including chunks from prior runs).

Notes

“dataset” here is a single (model, experiment, member, table, grid, variable) zarr store.
Parquet append to a single file is intentionally avoided; parquet is much happier as a directory of files.

Parameters:

df (DataFrame) – output of search_cmip6/find_available_data (ideally deduped + filtered).
aois (Dict[str, Tuple[Tuple[float, float], Tuple[float, float]]]) – dict mapping aoi_id -> ((lat_min, lat_max), (lon_min, lon_max)).
out_csv (Union[str, Path, None]) – optional final combined CSV to write (built from in-memory output or by reading chunks).
time_min/time_max – optional ISO-like strings for time slicing.
show_progress (bool) – use tqdm if available.
out_dir (Union[str, Path, None]) – directory to write chunk outputs (recommended for long runs).
chunk_format (str) – ‘parquet’ (default) or ‘csv’. If parquet engine is missing, it falls back to csv automatically.
resume (bool) – if True and out_dir provided, skip dataset rows that already have chunk output.
return_df (bool) – if True, return concatenated dataframe (reads chunks if out_dir provided).
fail_log (Union[str, Path, None]) – optional path to a log file for failures; failures are logged and the run continues.
retries (int) – number of attempts per dataset row on transient errors.
retry_backoff (float) – base backoff seconds (exponential) between retries.
time_min (str | None)
time_max (str | None)

Return type:

DataFrame

Returns:

Combined long-format dataframe, unless return_df=False (then returns empty dataframe).

dapper.met.cmip_utils.search_cmip6(params, col=None)[source]¶

Search the CMIP6 catalog and return the raw matches dataframe.

params keys (all optional):: experiment: list[str] -> experiment_id table: str|list[str] -> table_id variables: list[str] -> variable_id ensemble: str|list[str] -> member_id models: list[str] -> source_id grid: str|list[str] -> grid_label

Return type:: DataFrame
Returns:: pd.DataFrame of the intake-esm matches (metadata only).
Parameters:: params (dict)

dapper.met.cmip_utils.summarize_search(df)[source]¶

Convenience: quick summary of what you matched. Returns a small table you can print/log.

Return type:: DataFrame
Parameters:: df (DataFrame)

dapper.met.exporter module¶

Meteorological data export pipelines.

class dapper.met.exporter.Exporter(adapter, src_path, *, domain, out_dir=None, calendar='noleap', dtime_resolution_hrs=1, dtime_units='days', dformat='BYPASS', append_attrs=None, chunks=None, include_vars=None, exclude_vars=None)[source]¶

Bases: object

Source-agnostic meteorological exporter.

This class orchestrates a two-pass pipeline that ingests time-sharded CSVs for many sites/cells, preprocesses them via a pluggable adapter, and writes ELM-ready NetCDF outputs in two layouts:

"cellset" – one NetCDF per variable with dims ('DTIME','lat','lon') (global packing; sparse lat/lon axes are OK).

"sites" – one directory per site; each directory contains one NetCDF per variable with dims ('n','DTIME') where n=1 (per-site packing).

Exporter is source-agnostic: all dataset-specific logic (file discovery, unit conversions, renaming to ELM short names, etc.) lives in an adapter that implements the BaseAdapter interface (e.g., an ERA5Adapter). The exporter handles staging (CSV → per-site parquet), global DTIME axis creation, packing scans, chunking, and NetCDF I/O.

Parameters:

adapter (BaseAdapter) – Implements: discover_files, normalize_locations, preprocess_shard, required_vars, and pack_params.
csv_directory (str or pathlib.Path) – Directory containing time-sharded CSV files for all sites/cells.
out_dir (str or pathlib.Path) – Destination directory for NetCDF outputs and temporary parquet shards.
df_loc (pandas.DataFrame) – Locations table with at least columns ["gid","lat","lon"]; optional "zone". The adapter’s normalize_locations: - validates columns, - adds "lon_0-360", - fills/validates "zone", - sorts for stable site order.
id_col (str, optional) – Kept for backward compatibility (unused when "gid" is assumed).
calendar ({"noleap","standard"}, default "noleap") – Calendar for numeric DTIME coordinate; Feb 29 filtered for “noleap”.
dtime_resolution_hrs (int, default 1) – Target time resolution in hours for the DTIME axis.
dtime_units ({"days","hours"}, default "days") – Units of the numeric DTIME coordinate (e.g., "days since YYYY-MM-DD HH:MM:SS").
domain (Domain)
dformat (str)
append_attrs (dict | None)

dformat{“BYPASS”,”DATM_MODE”}, default “BYPASS”: Target ELM format selector passed through to the adapter.
append_attrsdict, optional: Extra global NetCDF attributes to include in every file. The exporter also adds: export_mode ("cellset" or "sites") and pack_scope ("global" or "per-site").
chunkstuple[int,…], optional: Explicit NetCDF chunk sizes.
include_vars / exclude_varsIterable[str], optional: Allow-/block-lists of ELM short names applied after preprocess. Meta columns {"gid","time","LATIXY","LONGXY","zone"} are always kept.

Side Effects¶

Creates a temporary directory of per-site parquet shards under out_dir.
Writes NetCDF files to out_dir in the chosen layout.
Writes a zone_mappings.txt file either at the root (cellset) or inside each site directory (sites).

Notes

Packing: global packing for cellset; per-site packing for sites.
Required columns: CSV shards and df_loc both use "gid"; CSVs include the adapter’s date/time column (renamed to "time" during preprocess).
Combined (lat/lon) layout: does not enforce regular grids; axes are the unique sorted lat/lon from df_loc (sparse OK).

run(*, pack_scope=None, filename=None, overwrite=False)[source]¶

Run the MET export for this exporter’s Domain.

The output layout is derived from Domain.mode:

sites: writes <run_dir>/<gid>/MET/{prefix_}{var}.nc and a per-site zone_mappings.txt (always zone=01, id=1).
cellset: writes <run_dir>/MET/{prefix_}{var}.nc and a single zone_mappings.txt covering all locations (zones taken from df_loc, default 1).

Parameters:

pack_scope – Optional packing strategy override. Defaults to per-site for sites and global for cellset outputs.
filename (str | None) – Optional filename prefix for output NetCDF files. If provided, each variable is written to {filename}_{var}.nc.
overwrite (bool) – If True, clears existing MET outputs before writing.

Return type:

None

dapper.met.temporal module¶

Temporal helpers used by Exporter and adapters. NetCDF I/O is handled in dapper.met.writers. This module is intentionally small.

dapper.met.temporal.create_dtime(df, calendar='standard', dtime_units='days', dtime_resolution_hrs=1.0)[source]¶

Construct a numeric DTIME axis and align data onto it at an arbitrary cadence. Accepts fractional hours, e.g., 0.5 (30 min), 0.3 (18 min), 1.5 (90 min).

Parameters:

calendar (str)
dtime_units (str)
dtime_resolution_hrs (float)

dapper.met.temporal.get_start_end_years(csv_filepaths, calendar='standard')[source]¶

Inspect CSVs (must contain a ‘date’ column) and return earliest/latest full years present. If no full years, return min/max year in data.

Parameters:: calendar (str)

dapper.met.validation module¶

dapper module: met.validation.

dapper.met.validation.make_quicklooks(exporter=None, *, write_directory=None, mode=None, vars=None, gids=None, out_dir=None, max_vars=9)[source]¶

Create per-site PNG quicklooks after an export has finished.

Supports all modes:

NetCDF: “cellset”, “sites”
Raw: “raw-site-parquet”, “raw-site-csv”

Parameters:

exporter (Exporter or None) – Optionally pass the Exporter instance you used for run(…). REQUIRED for ‘cellset’ (to map gids to lat/lon via the normalized domain geometry, i.e. exporter.domain_norm or exporter.df_loc_norm).
write_directory (path-like or None) – Where the export outputs live. If omitted and exporter is given, uses exporter.write_directory.
mode ({"cellset","sites","raw-site-parquet","raw-site-csv"} or None) – Export mode. If None, auto-detected by looking under write_directory.
vars (list[str] or None) – Variables to plot. For NetCDF modes use ELM short names; for raw modes use raw column names. If None, sensible defaults are used; if those aren’t present, first few numeric columns are chosen.
gids (list[str] or None) – Subset of GIDs to plot. If None, plot all available.
out_dir (path-like or None) – Destination for PNGs. Defaults to <write_directory>/quicklooks.
max_vars (int) – When vars is None and no defaults match, cap the number of auto-picked numeric columns to avoid huge figures.

Return type:

None

Notes

NetCDF modes require netCDF4 installed.
For ‘cellset’, pass the same exporter you ran with so we can use its df_loc_norm to locate each gid on the lat/lon axes.

dapper.met.writers module¶

dapper module: met.writers.

dapper.met.writers.append_met_netcdf(*, path_nc, var_name, data, indexers)[source]¶

Append data to variable var_name using indexers to select the region.

Notes

data should be float; netCDF4 will pack using var attrs (scale_factor/add_offset).
You can pass fewer indexers than dims; unspecified dims default to slice(None).

Parameters:

var_name (str)
indexers (dict[str, int | slice])

dapper.met.writers.initialize_met_netcdf(*, path_nc, var_name, dims, dim_lengths, dtime_name, dtime_vals, dtime_units, calendar, coord_specs, add_offset, scale_factor, dtype='i2', fill_value=32767, chunks, write_pattern='by_site', append_attrs=None, var_attrs=None, nc_format='NETCDF4_CLASSIC', zlib=True, shuffle=True, complevel=1)[source]¶

Create a NetCDF file with:

provided dims
numeric DTIME coord
site/grid coords from coord_specs
packed int var with add_offset/scale_factor

If chunks is None, uses _compute_auto_chunks(…) tuned to write_pattern.

Parameters:

var_name (str)
dims (tuple[str, ...])
dim_lengths (dict[str, int])
dtime_name (str)
dtime_units (str)
calendar (str)
coord_specs (list[dict])
add_offset (float)
scale_factor (float)
chunks (tuple[int, ...] | None)
write_pattern (str)
append_attrs (dict | None)
var_attrs (dict | None)
nc_format (str)
zlib (bool)
shuffle (bool)
complevel (int)

Module contents¶

dapper module: met.__init__.