dapper package¶

Subpackages¶

Module contents¶

dapper public package interface.

The supported public import surface is:

Other submodules may be importable, but are not considered part of the stable “from dapper import …” API.

class dapper.Domain(name, mode, provided, support, cells, topounits=None, topounits_dim_name='topounit', topounits_id_col='topounit_id', topounits_gid_col='gid', met_support=None, topo_support=None, domain_nc=None, path_out=None, run_group=None)[source]¶

Bases: object

Canonical spatial object passed through the dapper pipeline.

Geometry views:

provided: exactly what the user supplied (provenance/plotting)
support : what sampling SHOULD use (may be simplified/processed later)
cells : what ELM RUNS on (site points now; per-cell geometries for cellset)

Prepared sampling views (set during the relevant pipeline step; not on init):

met_support
topo_support

Mode:

sites : one set of outputs per row (exporters loop internally)
cellset : one set of outputs total, including all rows

Parameters:

name (str)
mode (DomainMode)
provided (gpd.GeoDataFrame)
support (gpd.GeoDataFrame)
cells (gpd.GeoDataFrame)
topounits (gpd.GeoDataFrame | None)
topounits_dim_name (str)
topounits_id_col (str)
topounits_gid_col (str)
met_support (Optional[gpd.GeoDataFrame])
topo_support (Optional[gpd.GeoDataFrame])
domain_nc (Optional[Path])
path_out (Optional[Path])
run_group (Optional[str])

cells: gpd.GeoDataFrame¶

copy(**updates)[source]¶

Return a shallow copy of this Domain with updated fields.

Return type:: Domain

domain_nc: Optional[Path] = None¶

elm_latlon_layout(decimals=6, use_lon_0360=True)[source]¶

Compute lat/lon axes and a gid -> (iy, ix) index map for ELM-style lat/lon layouts. Useful mainly when your cells actually lie on a lat/lon lattice (dense or sparse).

Return type:

tuple[ndarray, ndarray, dict[str, tuple[int, int]]]

Parameters:

decimals (int)
use_lon_0360 (bool)

ensure_cells_lon_lat()[source]¶

Ensure cells contains lon and lat columns, derived from geometry if needed.

Return type:: Domain

ensure_output_dirs(*, met=True)[source]¶

Create output directories implied by this Domain (and its runs, if mode=’sites’). Does not write any files.

Return type:: None
Parameters:: met (bool)

export_domain(*, filename='domain.nc', out_dir=None, overwrite=False, append_attrs=None, **kwargs)[source]¶

Export ELM domain NetCDF(s) for this Domain.

Output layout:

mode=’cellset’: <run_dir>/domain.nc
mode=’sites’ : <run_dir>/<gid>/domain.nc

Return type:

dict[str, Path]

Returns:

dict[run_id, output_path]

Parameters:

filename (str)
out_dir (str | Path | None)
overwrite (bool)
append_attrs (dict | None)

export_landuse(src_path, *, filename='landuse_timeseries.nc', out_dir=None, overwrite=False, append_attrs=None, **kwargs)[source]¶

Export landuse timeseries NetCDF(s) for this Domain.

Return type:

dict[str, Path]

Parameters:

src_path (str | Path)
filename (str)
out_dir (str | Path | None)
overwrite (bool)
append_attrs (dict | None)

export_met(src_path, *, adapter, out_dir=None, filename=None, overwrite=False, append_attrs=None, pack_scope=None, **kwargs)[source]¶

Export meteorological forcing NetCDF(s) for this Domain.

Parameters:

src_path (Path-like) – Directory containing input CSV(s) for the adapter.
adapter (object) – Adapter instance implementing the met adapter protocol.
out_dir (Path-like, optional) – Override output root. Defaults to Domain.run_dir.
filename (str, optional) – Optional filename prefix for output NetCDFs. If provided, each var is written to ‘{filename}_{var}.nc’.
overwrite (bool) – If False, raises if MET output(s) already exist.
append_attrs (dict | None)

Return type:

dict[run_id, met_dir]

export_surface(src_path, *, filename='surfdata.nc', out_dir=None, overwrite=False, append_attrs=None, **kwargs)[source]¶

Export surface NetCDF(s) for this Domain.

Return type:

dict[str, Path]

Parameters:

src_path (str | Path)
filename (str)
out_dir (str | Path | None)
overwrite (bool)
append_attrs (dict | None)

classmethod from_elm_domain(path_nc, *, name=None, mask_name='mask', frac_name='frac', frac_threshold=0.0, path_out=None, run_group=None)[source]¶

Build a Domain from an ELM domain NetCDF. This naturally produces a ‘cellset’.

Return type:

Domain

Parameters:

path_nc (str | Path)
name (str | None)
mask_name (str)
frac_name (str)
frac_threshold (float)
path_out (str | Path | None)
run_group (str | None)

classmethod from_file(path, *, name=None, layer=None, id_col='gid', mode=None, cell_kind=None, path_out=None, run_group=None)[source]¶

Load a geospatial file (e.g., GeoPackage, Shapefile) and construct a Domain.

Return type:

Domain

Parameters:

path (str | Path)
name (str | None)
layer (str | None)
id_col (str)
mode (Literal['sites', 'cellset'] | None)
cell_kind (Literal['site_points', 'as_provided'] | None)
path_out (str | Path | None)
run_group (str | None)

classmethod from_gdf(gdf, **kwargs)[source]¶

Alias for Domain.from_provided().

Return type:: Domain
Parameters:: gdf (geopandas.GeoDataFrame | DataFrame)

classmethod from_geometry(geometry, *, gid='site', name='domain', mode='cellset', cell_kind='site_points', path_out=None, run_group=None)[source]¶

Construct a single-feature Domain from a shapely geometry.

Return type:

Domain

Parameters:

geometry (shapely.geometry.base.BaseGeometry)
gid (str)
name (str)
mode (Literal['sites', 'cellset'])
cell_kind (Literal['site_points', 'as_provided'])
path_out (str | Path | None)
run_group (str | None)

classmethod from_provided(provided, *, name='domain', mode=None, id_col='gid', support=None, cells=None, cell_kind=None, domain_nc=None, path_out=None, run_group=None)[source]¶

Construct a Domain from a provided geometry table.

The input can be a GeoDataFrame (preferred) or a DataFrame with a geometry column. Use mode to choose between a single-run cellset or a multi-run sites container. If cells is not provided, cells are derived from the provided/support geometry depending on cell_kind.

Return type:

Domain

Parameters:

provided (geopandas.GeoDataFrame | DataFrame)
name (str)
mode (Literal['sites', 'cellset'] | None)
id_col (str)
support (geopandas.GeoDataFrame | DataFrame | None)
cells (geopandas.GeoDataFrame | DataFrame | None)
cell_kind (Literal['site_points', 'as_provided'] | None)
domain_nc (str | Path | None)
path_out (str | Path | None)
run_group (str | None)

property gdf: geopandas.GeoDataFrame¶

Backwards-compat alias for older code.

Historically dapper exposed a single GeoDataFrame on the domain object (often a df_loc-style table with columns like gid/lon/lat/geometry). The closest equivalent is cells (the run-level geometry table).

property gids: list[str]¶: List of gid values (as strings) for the current cell table.

property group_name: str¶: Name of the output group directory for this Domain.

has_topounits()[source]¶

Return True if this Domain has a non-empty topounits table attached.

Return type:: bool

iter_runs()[source]¶: Yield (run_id, run_domain) where run_domain is always a single-run ‘cellset’ Domain. - mode=’cellset’ -> yields exactly one run (self) - mode=’sites’ -> yields one run per gid (single-row Domains)

make_topounits(*, binning, sources=None, combine='cartesian', combine_order=None, max_topounits=256, dem_source='arcticdem', export_scale='native', min_patch_pixels=None, target_pixels_per_topounit=500, target_scale=None, verbose=False, allow_slow_ncells=25)[source]¶

Convenience wrapper that computes topounits for this Domain and returns a new Domain with domain.topounits attached.

If this Domain has multiple rows (cellset/sites), it computes topounits per-row (per gid) using dapper.topounit.topomake.make_topounits_for_domain.
If this Domain has one row, it still goes through the same path (safe + consistent).

Users should not need to deal with ee.Geometry vs ee.Feature vs FeatureCollection here.

Return type:

Domain

Parameters:

binning (dict)
sources (list[str] | None)
combine (str)
max_topounits (int)
dem_source (str)
export_scale (str)
target_pixels_per_topounit (int)
target_scale (float | None)
verbose (bool)
allow_slow_ncells (int)

property met_dir: Path¶: Convenience property for Domain.path_met_dir().

met_support: Optional[gpd.GeoDataFrame] = None¶

mode: DomainMode¶

name: str¶

path_domain_nc(filename='domain.nc', run_id=None)[source]¶

Default output path for the domain NetCDF for this Domain (or a specific run_id).

Return type:

Path

Parameters:

filename (str)
run_id (str | None)

path_landuse_nc(filename='landuse_timeseries.nc', run_id=None)[source]¶

Default output path for the landuse NetCDF for this Domain (or a specific run_id).

Return type:

Path

Parameters:

filename (str)
run_id (str | None)

path_met_dir(run_id=None)[source]¶

Path to the MET output directory for this Domain (or a specific run_id).

Return type:: Path
Parameters:: run_id (str | None)

path_out: Optional[Path] = None¶

path_surface_nc(filename='surfdata.nc', run_id=None)[source]¶

Default output path for the surface NetCDF for this Domain (or a specific run_id).

Return type:

Path

Parameters:

filename (str)
run_id (str | None)

path_zone_mappings(filename='zone_mappings.txt', run_id=None)[source]¶

Default output path for zone_mappings.txt (under the MET directory).

Return type:

Path

Parameters:

filename (str)
run_id (str | None)

provided: gpd.GeoDataFrame¶

rep_points(*, source='support', step=None)[source]¶

Representative points for a given geometry view. If source=’support’ and step is provided, uses the prepared support for that step.

Return type:

GeoDataFrame

Parameters:

source (Literal['provided', 'support', 'cells'])
step (Literal['met', 'topounits'] | None)

property run_dir: Path¶

Directory holding the main run outputs for this Domain instance.

For top-level cellset or top-level sites container:: path_out/<group_name>
For per-site/per-cellset run domains (created by iter_runs in sites mode):: path_out/<group_name>/<domain.name>

run_group: Optional[str] = None¶

simplify_support(tolerance_m, *, step, preserve_topology=True, equal_area_epsg=6933)[source]¶

Simplify the support geometry for a step and store it as met_support/topo_support. Does NOT modify provided/support/cells.

Return type:

Domain

Parameters:

tolerance_m (float)
step (Literal['met', 'topounits'])
preserve_topology (bool)
equal_area_epsg (int)

support: gpd.GeoDataFrame¶

support_for(*, step=None)[source]¶

Return the geometry set that should be used for the given step. - step=None -> support - step=”met” -> met_support if set else support - step=”topounits” -> topo_support if set else support

Return type:: GeoDataFrame
Parameters:: step (Literal['met', 'topounits'] | None)

to_df_loc(*, lon_col='lon', lat_col='lat', weight_col='weight', frac_col='frac', default_weight=1.0)[source]¶

Derived location/weight table from cells (internal glue; users shouldn’t need to touch).

Return type:

DataFrame

Parameters:

lon_col (str)
lat_col (str)
weight_col (str)
frac_col (str)
default_weight (float)

topo_support: Optional[gpd.GeoDataFrame] = None¶

topounits: gpd.GeoDataFrame | None = None¶

topounits_dim_name: str = 'topounit'¶

topounits_for_gid(gid)[source]¶

Return the topounits subset for a single gid (or None if no topounits).

Parameters:: gid (str)

topounits_gid_col: str = 'gid'¶

topounits_id_col: str = 'topounit_id'¶

with_step_support(step, gdf)[source]¶

Attach a step-specific support GeoDataFrame (for “met” or “topounits”).

Return type:

Domain

Parameters:

step (Literal['met', 'topounits'])
gdf (geopandas.GeoDataFrame)

with_topounits(topounits, *, id_col='band_name', gid_col='gid', dim_name='topounit')[source]¶

Attach topounits GeoDataFrame to this Domain. - Ensures a stable id column name (self.topounits_id_col == ‘topounit_id’) - Ensures gid linkage column exists (self.topounits_gid_col)

Return type:

Domain

Parameters:

topounits (geopandas.GeoDataFrame)
id_col (str)
gid_col (str)
dim_name (str)

class dapper.ERA5Adapter[source]¶

Bases: BaseAdapter

ERA5-Land → ELM adapter.

This adapter implements the BaseAdapter interface for ERA5-Land hourly data. It handles source-specific details—file discovery, unit conversions, humidity diagnostics, renaming to ELM short names, and nonnegativity enforcement, so the upstream Exporter can remain source-agnostic.

Responsibilities¶

discover_files: Find CSV shards in a directory and infer the overall (start_year, end_year) using their date coverage.
normalize_locations: Validate and normalize the locations table (adds lon_0-360, ensures/creates zone, stable sorting).
id_column_for_csv: Declare the identifier column name in the input CSVs. For ERA5 we require gid.
preprocess_shard: Convert one merged shard (CSV rows joined to locations) into canonical ELM columns. Steps include:
1. time filtering and optional “noleap” removal of Feb 29
2. ERA5→ELM unit conversions (e.g., J/hr/m² → W/m², m/hr → mm/s)
3. optional humidity computation (RH/Q) if temperature, dewpoint, and surface pressure are available
4. renaming raw ERA5 fields to ELM short names via a mapping
5. clipping canonical nonnegative variables
6. returning only required columns in a deterministic order
required_vars: Report the canonical ELM variable names required for the requested output format.
pack_params: Provide robust (add_offset, scale_factor) for a canonical ELM variable, given optional data to tune ranges.

Notes

Humidity computation is performed only when temperature_2m, dewpoint_temperature_2m, and surface_pressure are present.
Precipitation conversion uses m/hr → mm/s via division by 3.6.

DRIVER_TAG = 'ERA5'¶

SOURCE_NAME = 'ERA5-Land hourly reanalysis'¶

discover_files(csv_directory, calendar)[source]¶: Discover ERA5 CSV shards in a directory and infer the inclusive year range.

id_column_for_csv(df_csv, id_col)[source]¶: Return the required identifier column name expected in ERA5 CSV shards (“gid”).

pack_params(elm_var, data=None)[source]¶: Return (add_offset, scale_factor) used to pack a variable for NetCDF output.

preprocess_shard(df_merged, start_year, end_year, calendar, dformat)[source]¶

Filter time & handle no-leap
Apply ERA5 → ELM unit conversions
Compute humidities (if columns available)
Rename columns to canonical ELM names using RAW_TO_ELM
Clip canonical nonnegative variables
Return only the canonical vars required by elm_required_vars(dformat), plus LONGXY/LATIXY/time/gid/zone (coords/meta).

required_vars(dformat)[source]¶: Return the canonical ELM variables required for the requested output format.

class dapper.Exporter(adapter, src_path, *, domain, out_dir=None, calendar='noleap', dtime_resolution_hrs=1, dtime_units='days', dformat='BYPASS', append_attrs=None, chunks=None, include_vars=None, exclude_vars=None)[source]¶

Bases: object

Source-agnostic meteorological exporter.

This class orchestrates a two-pass pipeline that ingests time-sharded CSVs for many sites/cells, preprocesses them via a pluggable adapter, and writes ELM-ready NetCDF outputs in two layouts:

"cellset" – one NetCDF per variable with dims ('DTIME','lat','lon') (global packing; sparse lat/lon axes are OK).

"sites" – one directory per site; each directory contains one NetCDF per variable with dims ('n','DTIME') where n=1 (per-site packing).

Exporter is source-agnostic: all dataset-specific logic (file discovery, unit conversions, renaming to ELM short names, etc.) lives in an adapter that implements the BaseAdapter interface (e.g., an ERA5Adapter). The exporter handles staging (CSV → per-site parquet), global DTIME axis creation, packing scans, chunking, and NetCDF I/O.

Parameters:

adapter (BaseAdapter) – Implements: discover_files, normalize_locations, preprocess_shard, required_vars, and pack_params.
csv_directory (str or pathlib.Path) – Directory containing time-sharded CSV files for all sites/cells.
out_dir (str or pathlib.Path) – Destination directory for NetCDF outputs and temporary parquet shards.
df_loc (pandas.DataFrame) – Locations table with at least columns ["gid","lat","lon"]; optional "zone". The adapter’s normalize_locations: - validates columns, - adds "lon_0-360", - fills/validates "zone", - sorts for stable site order.
id_col (str, optional) – Kept for backward compatibility (unused when "gid" is assumed).
calendar ({"noleap","standard"}, default "noleap") – Calendar for numeric DTIME coordinate; Feb 29 filtered for “noleap”.
dtime_resolution_hrs (int, default 1) – Target time resolution in hours for the DTIME axis.
dtime_units ({"days","hours"}, default "days") – Units of the numeric DTIME coordinate (e.g., "days since YYYY-MM-DD HH:MM:SS").
domain (Domain)
dformat (str)
append_attrs (dict | None)

dformat{“BYPASS”,”DATM_MODE”}, default “BYPASS”: Target ELM format selector passed through to the adapter.
append_attrsdict, optional: Extra global NetCDF attributes to include in every file. The exporter also adds: export_mode ("cellset" or "sites") and pack_scope ("global" or "per-site").
chunkstuple[int,…], optional: Explicit NetCDF chunk sizes.
include_vars / exclude_varsIterable[str], optional: Allow-/block-lists of ELM short names applied after preprocess. Meta columns {"gid","time","LATIXY","LONGXY","zone"} are always kept.

Side Effects¶

Creates a temporary directory of per-site parquet shards under out_dir.
Writes NetCDF files to out_dir in the chosen layout.
Writes a zone_mappings.txt file either at the root (cellset) or inside each site directory (sites).

Notes

Packing: global packing for cellset; per-site packing for sites.
Required columns: CSV shards and df_loc both use "gid"; CSVs include the adapter’s date/time column (renamed to "time" during preprocess).
Combined (lat/lon) layout: does not enforce regular grids; axes are the unique sorted lat/lon from df_loc (sparse OK).

run(*, pack_scope=None, filename=None, overwrite=False)[source]¶

Run the MET export for this exporter’s Domain.

The output layout is derived from Domain.mode:

sites: writes <run_dir>/<gid>/MET/{prefix_}{var}.nc and a per-site zone_mappings.txt (always zone=01, id=1).
cellset: writes <run_dir>/MET/{prefix_}{var}.nc and a single zone_mappings.txt covering all locations (zones taken from df_loc, default 1).

Parameters:

pack_scope – Optional packing strategy override. Defaults to per-site for sites and global for cellset outputs.
filename (str | None) – Optional filename prefix for output NetCDF files. If provided, each variable is written to {filename}_{var}.nc.
overwrite (bool) – If True, clears existing MET outputs before writing.

Return type:

None

dapper.sample_e5lh(params, domain_name=None, skip_tasks=False)[source]¶

Submit Google Earth Engine (GEE) export tasks for ERA5-Land Hourly time series.

This prepares the ERA5-Land Hourly ImageCollection ("ECMWF/ERA5_LAND/HOURLY"), validates bands, ensures each geometry samples at least one pixel center (falling back to points when needed), batches the requested date range into N-year chunks, and (unless skip_tasks=True) starts one Drive export task per batch.

Parameters:

params (dict) –
Configuration dictionary. Expected keys (case-sensitive):
- start_date (str): Start date in "YYYY-MM-DD".
- end_date (str): End date in "YYYY-MM-DD".
- geometries: One of the following:
  - str: GEE asset ID for a FeatureCollection (e.g., "users/me/my_fc").
  - ee.FeatureCollection: a pre-constructed collection.
  - GeoDataFrame: must contain geometry and an ID column (see geometry_id_field).
  - AOI: dapper.domains.aoi.AOI instance; uses its internal GeoDataFrame.
  - Domain: dapper.domains.domain.Domain instance; uses Domain.to_geometries().
- geometry_id_field (str, optional): ID column in provided geometries. Defaults to "gid". Values are copied into the "gid" property on each feature.
- gee_bands (str or list[str]): Which ERA5-Land bands to export. One of:
  - "all": all available bands (from era5.ALL_BANDS)
  - "elm": bands required to derive ELM variables (from era5.REQUIRED_RAW_BANDS)
  - a list of band names validated against the collection
- gdrive_folder (str): Google Drive folder name where CSV chunks are written.
- job_name (str): Base name used to build per-batch export descriptions/filenames.
- gee_scale (str or int or float): Sampling scale in meters. If "native" (or a value < 11132), the native ERA5-Land scale of 11132 m is used.
- gee_years_per_task (int, optional): Years per export batch (default: 5).
The function sets params["gee_ic"] = "ECMWF/ERA5_LAND/HOURLY" internally.
domain_name (str, optional) – Optional name for the returned Domain.
skip_tasks (bool, default False) – If True, do everything except starting the GEE export tasks.

Returns:

Domain describing the sampling locations. The underlying GeoDataFrame contains at least "gid", "lon", and "lat".

Return type:

Domain

Notes

Call ee.Initialize() before using this function.
CSV selectors include ["gid", "date"] + params["gee_bands"].
Dates are derived from system:time_start and formatted in UTC.

Raises:

KeyError – If required keys are missing from params.
ValueError – If dates are malformed or geometries is an unsupported type.
TypeError – If gee_scale is not "native" and not numeric.
ee.EEException – Propagated Earth Engine errors (e.g., authentication, export quota).

Examples

params = {
    "start_date": "1950-01-01",
    "end_date": "1951-12-31",
    "geometries": "users/me/my_sites_fc",
    "geometry_id_field": "gid",
    "gee_bands": "elm",
    "gee_scale": "native",
    "gee_years_per_task": 5,
    "gdrive_folder": "era5_exports",
    "job_name": "era5l_sites",
}
domain = sample_e5lh(params)
domain.gdf.head()