dapper.met package¶
Subpackages¶
Submodules¶
dapper.met.cmip_utils module¶
CMIP6 utilities (Pangeo / intake-esm)
Design goals: - Separate search/listing of available datasets from sampling. - Cache the catalog so repeated searches don’t re-parse the JSON. - Support fast “dataset-first” sampling: open each remote zarr once, then compute means for many AOIs.
- Typical workflow:
col = open_cmip6_catalog() df_all = search_cmip6(params, col=col) df_use = dedupe_latest(df_all) df_use = filter_complete(df_use, required_vars=params[“variables”]) df_use = df_use[df_use[“source_id”].isin([…])] # optional hard filter out = sample_bbox_means_for_aois(df_use, aois={…}, out_csv=…)
- dapper.met.cmip_utils.bounds_from_geojson(path)[source]¶
- Return type:
Tuple[Tuple[float,float],Tuple[float,float]]- Parameters:
path (str | Path)
- Read a polygon GeoJSON/shapefile and return:
lat_bounds = (lat_min, lat_max) lon_bounds = (lon_min, lon_max)
Requires geopandas.
- dapper.met.cmip_utils.cftime_date(string_date, sample_cftime)[source]¶
Convert YYYY-MM-DD to same cftime type as sample_cftime.
- Parameters:
string_date (str)
- dapper.met.cmip_utils.dedupe_latest(df)[source]¶
Deduplicate by keeping the latest ‘version’ (if present) for each dataset key.
In the Pangeo CMIP6 catalog, duplicates often exist for the same (model, experiment, member, table, grid, variable) with different versions.
- Return type:
DataFrame- Parameters:
df (DataFrame)
- dapper.met.cmip_utils.download_pangeo(df, dir_out, lat=None, lon=None, lat_bounds=None, lon_bounds=None, polygon_path=None)[source]¶
Download CMIP6 data from Pangeo to NetCDF, with optional spatial subsetting.
Note: only one of (lat/lon), (lat_bounds/lon_bounds), or (polygon_path) should be provided.
- Parameters:
df (DataFrame)
dir_out (str | Path)
lat (float | None)
lon (float | None)
lat_bounds (Tuple[float, float] | None)
lon_bounds (Tuple[float, float] | None)
polygon_path (str | Path | None)
- dapper.met.cmip_utils.extract_vars_from_files(files, start_date, end_date, path_out)[source]¶
Robust CMIP6 NetCDF merger for multiple calendars — using CFDatetimeCoder. This is slow but robust.
- Parameters:
files (Iterable[str | Path])
start_date (str)
end_date (str)
path_out (str | Path)
- dapper.met.cmip_utils.filter_complete(df, required_vars, group_cols=None)[source]¶
Keep only dataset groups that contain all required variables.
- Return type:
DataFrame- Parameters:
df (DataFrame)
required_vars (Sequence[str])
group_cols (Sequence[str] | None)
- By default, completeness is enforced per:
(source_id, experiment_id, member_id, table_id, grid_label)
This prevents “mixing” variables from different grids/members/experiments.
- dapper.met.cmip_utils.find_available_data(params, col=None)[source]¶
Backwards-compatible wrapper around the new search/filter approach.
Note
This is now fast because open_cmip6_catalog is cached.
It enforces completeness across all requested variables.
- Return type:
DataFrame- Parameters:
params (dict)
- dapper.met.cmip_utils.open_cmip6_catalog(url='https://storage.googleapis.com/cmip6/pangeo-cmip6.json')[source]¶
Open (and cache) the Pangeo CMIP6 intake-esm catalog.
Important
intake-esm registers its plugin into intake, so you open via intake.open_esm_datastore(…).
- Parameters:
url (str)
- dapper.met.cmip_utils.sample_bbox_means_for_aois(df, aois, out_csv=None, time_min=None, time_max=None, show_progress=True, *, out_dir=None, chunk_format='parquet', resume=True, return_df=True, fail_log=None, retries=3, retry_backoff=1.0)[source]¶
- Dataset-first sampler (pattern you wanted):
Loop over datasets (rows of df), open each zarr once
For that dataset, compute bbox-mean time series for each AOI
- Crash-resilient mode (recommended):
Set out_dir=… to write ONE chunk file per dataset row as you go.
If resume=True, already-written chunks are skipped on rerun.
If return_df=True, the function returns the concatenation of all chunks in out_dir (including chunks from prior runs).
Notes
“dataset” here is a single (model, experiment, member, table, grid, variable) zarr store.
Parquet append to a single file is intentionally avoided; parquet is much happier as a directory of files.
- Parameters:
df (
DataFrame) – output of search_cmip6/find_available_data (ideally deduped + filtered).aois (
Dict[str,Tuple[Tuple[float,float],Tuple[float,float]]]) – dict mapping aoi_id -> ((lat_min, lat_max), (lon_min, lon_max)).out_csv (
Union[str,Path,None]) – optional final combined CSV to write (built from in-memory output or by reading chunks).time_min/time_max – optional ISO-like strings for time slicing.
show_progress (
bool) – use tqdm if available.out_dir (
Union[str,Path,None]) – directory to write chunk outputs (recommended for long runs).chunk_format (
str) – ‘parquet’ (default) or ‘csv’. If parquet engine is missing, it falls back to csv automatically.resume (
bool) – if True and out_dir provided, skip dataset rows that already have chunk output.return_df (
bool) – if True, return concatenated dataframe (reads chunks if out_dir provided).fail_log (
Union[str,Path,None]) – optional path to a log file for failures; failures are logged and the run continues.retries (
int) – number of attempts per dataset row on transient errors.retry_backoff (
float) – base backoff seconds (exponential) between retries.time_min (str | None)
time_max (str | None)
- Return type:
DataFrame- Returns:
Combined long-format dataframe, unless return_df=False (then returns empty dataframe).
- dapper.met.cmip_utils.search_cmip6(params, col=None)[source]¶
Search the CMIP6 catalog and return the raw matches dataframe.
- params keys (all optional):
experiment: list[str] -> experiment_id table: str|list[str] -> table_id variables: list[str] -> variable_id ensemble: str|list[str] -> member_id models: list[str] -> source_id grid: str|list[str] -> grid_label
- Return type:
DataFrame- Returns:
pd.DataFrame of the intake-esm matches (metadata only).
- Parameters:
params (dict)
dapper.met.exporter module¶
Meteorological data export pipelines.
- class dapper.met.exporter.Exporter(adapter, src_path, *, domain, out_dir=None, calendar='noleap', dtime_resolution_hrs=1, dtime_units='days', dformat='BYPASS', append_attrs=None, chunks=None, include_vars=None, exclude_vars=None)[source]¶
Bases:
objectSource-agnostic meteorological exporter.
This class orchestrates a two-pass pipeline that ingests time-sharded CSVs for many sites/cells, preprocesses them via a pluggable adapter, and writes ELM-ready NetCDF outputs in two layouts:
"cellset"– one NetCDF per variable with dims('DTIME','lat','lon')(global packing; sparse lat/lon axes are OK)."sites"– one directory per site; each directory contains one NetCDF per variable with dims('n','DTIME')wheren=1(per-site packing).
Exporter is source-agnostic: all dataset-specific logic (file discovery, unit conversions, renaming to ELM short names, etc.) lives in an adapter that implements the BaseAdapter interface (e.g., an
ERA5Adapter). The exporter handles staging (CSV → per-site parquet), global DTIME axis creation, packing scans, chunking, and NetCDF I/O.- Parameters:
adapter (BaseAdapter) – Implements:
discover_files,normalize_locations,preprocess_shard,required_vars, andpack_params.csv_directory (str or pathlib.Path) – Directory containing time-sharded CSV files for all sites/cells.
out_dir (str or pathlib.Path) – Destination directory for NetCDF outputs and temporary parquet shards.
df_loc (pandas.DataFrame) – Locations table with at least columns
["gid","lat","lon"]; optional"zone". The adapter’snormalize_locations: - validates columns, - adds"lon_0-360", - fills/validates"zone", - sorts for stable site order.id_col (str, optional) – Kept for backward compatibility (unused when
"gid"is assumed).calendar ({"noleap","standard"}, default "noleap") – Calendar for numeric DTIME coordinate; Feb 29 filtered for “noleap”.
dtime_resolution_hrs (int, default 1) – Target time resolution in hours for the DTIME axis.
dtime_units ({"days","hours"}, default "days") – Units of the numeric DTIME coordinate (e.g.,
"days since YYYY-MM-DD HH:MM:SS").domain (Domain)
dformat (str)
append_attrs (dict | None)
- dformat{“BYPASS”,”DATM_MODE”}, default “BYPASS”
Target ELM format selector passed through to the adapter.
- append_attrsdict, optional
Extra global NetCDF attributes to include in every file. The exporter also adds:
export_mode("cellset"or"sites") andpack_scope("global"or"per-site").- chunkstuple[int,…], optional
Explicit NetCDF chunk sizes.
- include_vars / exclude_varsIterable[str], optional
Allow-/block-lists of ELM short names applied after preprocess. Meta columns
{"gid","time","LATIXY","LONGXY","zone"}are always kept.
Side Effects¶
Creates a temporary directory of per-site parquet shards under
out_dir.Writes NetCDF files to
out_dirin the chosen layout.Writes a
zone_mappings.txtfile either at the root (cellset) or inside each site directory (sites).
Notes
Packing: global packing for
cellset; per-site packing forsites.Required columns: CSV shards and
df_locboth use"gid"; CSVs include the adapter’s date/time column (renamed to"time"during preprocess).Combined (lat/lon) layout: does not enforce regular grids; axes are the unique sorted lat/lon from
df_loc(sparse OK).
- run(*, pack_scope=None, filename=None, overwrite=False)[source]¶
Run the MET export for this exporter’s Domain.
- The output layout is derived from
Domain.mode: sites: writes<run_dir>/<gid>/MET/{prefix_}{var}.ncand a per-sitezone_mappings.txt(always zone=01, id=1).cellset: writes<run_dir>/MET/{prefix_}{var}.ncand a singlezone_mappings.txtcovering all locations (zones taken from df_loc, default 1).
- Parameters:
pack_scope – Optional packing strategy override. Defaults to
per-sitefor sites andglobalfor cellset outputs.filename (
str|None) – Optional filename prefix for output NetCDF files. If provided, each variable is written to{filename}_{var}.nc.overwrite (
bool) – If True, clears existing MET outputs before writing.
- Return type:
None
- The output layout is derived from
dapper.met.temporal module¶
Temporal helpers used by Exporter and adapters. NetCDF I/O is handled in dapper.met.writers. This module is intentionally small.
- dapper.met.temporal.create_dtime(df, calendar='standard', dtime_units='days', dtime_resolution_hrs=1.0)[source]¶
Construct a numeric DTIME axis and align data onto it at an arbitrary cadence. Accepts fractional hours, e.g., 0.5 (30 min), 0.3 (18 min), 1.5 (90 min).
- Parameters:
calendar (str)
dtime_units (str)
dtime_resolution_hrs (float)
dapper.met.validation module¶
dapper module: met.validation.
- dapper.met.validation.make_quicklooks(exporter=None, *, write_directory=None, mode=None, vars=None, gids=None, out_dir=None, max_vars=9)[source]¶
Create per-site PNG quicklooks after an export has finished.
- Supports all modes:
NetCDF: “cellset”, “sites”
Raw: “raw-site-parquet”, “raw-site-csv”
- Parameters:
exporter (Exporter or None) – Optionally pass the Exporter instance you used for run(…). REQUIRED for ‘cellset’ (to map gids to lat/lon via the normalized domain geometry, i.e.
exporter.domain_normorexporter.df_loc_norm).write_directory (path-like or None) – Where the export outputs live. If omitted and exporter is given, uses exporter.write_directory.
mode ({"cellset","sites","raw-site-parquet","raw-site-csv"} or None) – Export mode. If None, auto-detected by looking under write_directory.
vars (list[str] or None) – Variables to plot. For NetCDF modes use ELM short names; for raw modes use raw column names. If None, sensible defaults are used; if those aren’t present, first few numeric columns are chosen.
gids (list[str] or None) – Subset of GIDs to plot. If None, plot all available.
out_dir (path-like or None) – Destination for PNGs. Defaults to <write_directory>/quicklooks.
max_vars (int) – When vars is None and no defaults match, cap the number of auto-picked numeric columns to avoid huge figures.
- Return type:
None
Notes
NetCDF modes require netCDF4 installed.
For ‘cellset’, pass the same exporter you ran with so we can use its df_loc_norm to locate each gid on the lat/lon axes.
dapper.met.writers module¶
dapper module: met.writers.
- dapper.met.writers.append_met_netcdf(*, path_nc, var_name, data, indexers)[source]¶
Append data to variable var_name using indexers to select the region.
Notes
data should be float; netCDF4 will pack using var attrs (scale_factor/add_offset).
You can pass fewer indexers than dims; unspecified dims default to slice(None).
- Parameters:
var_name (str)
indexers (dict[str, int | slice])
- dapper.met.writers.initialize_met_netcdf(*, path_nc, var_name, dims, dim_lengths, dtime_name, dtime_vals, dtime_units, calendar, coord_specs, add_offset, scale_factor, dtype='i2', fill_value=32767, chunks, write_pattern='by_site', append_attrs=None, var_attrs=None, nc_format='NETCDF4_CLASSIC', zlib=True, shuffle=True, complevel=1)[source]¶
- Create a NetCDF file with:
provided dims
numeric DTIME coord
site/grid coords from coord_specs
packed int var with add_offset/scale_factor
If chunks is None, uses _compute_auto_chunks(…) tuned to write_pattern.
- Parameters:
var_name (str)
dims (tuple[str, ...])
dim_lengths (dict[str, int])
dtime_name (str)
dtime_units (str)
calendar (str)
coord_specs (list[dict])
add_offset (float)
scale_factor (float)
chunks (tuple[int, ...] | None)
write_pattern (str)
append_attrs (dict | None)
var_attrs (dict | None)
nc_format (str)
zlib (bool)
shuffle (bool)
complevel (int)
Module contents¶
dapper module: met.__init__.