dapper package

Subpackages

Module contents

dapper public package interface.

The supported public import surface is:

Other submodules may be importable, but are not considered part of the stable “from dapper import …” API.

class dapper.Domain(name, mode, provided, support, cells, topounits=None, topounits_dim_name='topounit', topounits_id_col='topounit_id', topounits_gid_col='gid', met_support=None, topo_support=None, domain_nc=None, path_out=None, run_group=None)[source]

Bases: object

Canonical spatial object passed through the dapper pipeline.

Geometry views:
  • provided: exactly what the user supplied (provenance/plotting)

  • support : what sampling SHOULD use (may be simplified/processed later)

  • cells : what ELM RUNS on (site points now; per-cell geometries for cellset)

Prepared sampling views (set during the relevant pipeline step; not on init):
  • met_support

  • topo_support

Mode:
  • sites : one set of outputs per row (exporters loop internally)

  • cellset : one set of outputs total, including all rows

Parameters:
  • name (str)

  • mode (DomainMode)

  • provided (gpd.GeoDataFrame)

  • support (gpd.GeoDataFrame)

  • cells (gpd.GeoDataFrame)

  • topounits (gpd.GeoDataFrame | None)

  • topounits_dim_name (str)

  • topounits_id_col (str)

  • topounits_gid_col (str)

  • met_support (Optional[gpd.GeoDataFrame])

  • topo_support (Optional[gpd.GeoDataFrame])

  • domain_nc (Optional[Path])

  • path_out (Optional[Path])

  • run_group (Optional[str])

cells: gpd.GeoDataFrame
copy(**updates)[source]

Return a shallow copy of this Domain with updated fields.

Return type:

Domain

domain_nc: Optional[Path] = None
elm_latlon_layout(decimals=6, use_lon_0360=True)[source]

Compute lat/lon axes and a gid -> (iy, ix) index map for ELM-style lat/lon layouts. Useful mainly when your cells actually lie on a lat/lon lattice (dense or sparse).

Return type:

tuple[ndarray, ndarray, dict[str, tuple[int, int]]]

Parameters:
  • decimals (int)

  • use_lon_0360 (bool)

ensure_cells_lon_lat()[source]

Ensure cells contains lon and lat columns, derived from geometry if needed.

Return type:

Domain

ensure_output_dirs(*, met=True)[source]

Create output directories implied by this Domain (and its runs, if mode=’sites’). Does not write any files.

Return type:

None

Parameters:

met (bool)

export_domain(*, filename='domain.nc', out_dir=None, overwrite=False, append_attrs=None, **kwargs)[source]

Export ELM domain NetCDF(s) for this Domain.

Output layout:
  • mode=’cellset’: <run_dir>/domain.nc

  • mode=’sites’ : <run_dir>/<gid>/domain.nc

Return type:

dict[str, Path]

Returns:

dict[run_id, output_path]

Parameters:
  • filename (str)

  • out_dir (str | Path | None)

  • overwrite (bool)

  • append_attrs (dict | None)

export_landuse(src_path, *, filename='landuse_timeseries.nc', out_dir=None, overwrite=False, append_attrs=None, **kwargs)[source]

Export landuse timeseries NetCDF(s) for this Domain.

Return type:

dict[str, Path]

Parameters:
  • src_path (str | Path)

  • filename (str)

  • out_dir (str | Path | None)

  • overwrite (bool)

  • append_attrs (dict | None)

export_met(src_path, *, adapter, out_dir=None, filename=None, overwrite=False, append_attrs=None, pack_scope=None, **kwargs)[source]

Export meteorological forcing NetCDF(s) for this Domain.

Parameters:
  • src_path (Path-like) – Directory containing input CSV(s) for the adapter.

  • adapter (object) – Adapter instance implementing the met adapter protocol.

  • out_dir (Path-like, optional) – Override output root. Defaults to Domain.run_dir.

  • filename (str, optional) – Optional filename prefix for output NetCDFs. If provided, each var is written to ‘{filename}_{var}.nc’.

  • overwrite (bool) – If False, raises if MET output(s) already exist.

  • append_attrs (dict | None)

Return type:

dict[run_id, met_dir]

export_surface(src_path, *, filename='surfdata.nc', out_dir=None, overwrite=False, append_attrs=None, **kwargs)[source]

Export surface NetCDF(s) for this Domain.

Return type:

dict[str, Path]

Parameters:
  • src_path (str | Path)

  • filename (str)

  • out_dir (str | Path | None)

  • overwrite (bool)

  • append_attrs (dict | None)

classmethod from_elm_domain(path_nc, *, name=None, mask_name='mask', frac_name='frac', frac_threshold=0.0, path_out=None, run_group=None)[source]

Build a Domain from an ELM domain NetCDF. This naturally produces a ‘cellset’.

Return type:

Domain

Parameters:
  • path_nc (str | Path)

  • name (str | None)

  • mask_name (str)

  • frac_name (str)

  • frac_threshold (float)

  • path_out (str | Path | None)

  • run_group (str | None)

classmethod from_file(path, *, name=None, layer=None, id_col='gid', mode=None, cell_kind=None, path_out=None, run_group=None)[source]

Load a geospatial file (e.g., GeoPackage, Shapefile) and construct a Domain.

Return type:

Domain

Parameters:
  • path (str | Path)

  • name (str | None)

  • layer (str | None)

  • id_col (str)

  • mode (Literal['sites', 'cellset'] | None)

  • cell_kind (Literal['site_points', 'as_provided'] | None)

  • path_out (str | Path | None)

  • run_group (str | None)

classmethod from_gdf(gdf, **kwargs)[source]

Alias for Domain.from_provided().

Return type:

Domain

Parameters:

gdf (geopandas.GeoDataFrame | DataFrame)

classmethod from_geometry(geometry, *, gid='site', name='domain', mode='cellset', cell_kind='site_points', path_out=None, run_group=None)[source]

Construct a single-feature Domain from a shapely geometry.

Return type:

Domain

Parameters:
  • geometry (shapely.geometry.base.BaseGeometry)

  • gid (str)

  • name (str)

  • mode (Literal['sites', 'cellset'])

  • cell_kind (Literal['site_points', 'as_provided'])

  • path_out (str | Path | None)

  • run_group (str | None)

classmethod from_provided(provided, *, name='domain', mode=None, id_col='gid', support=None, cells=None, cell_kind=None, domain_nc=None, path_out=None, run_group=None)[source]

Construct a Domain from a provided geometry table.

The input can be a GeoDataFrame (preferred) or a DataFrame with a geometry column. Use mode to choose between a single-run cellset or a multi-run sites container. If cells is not provided, cells are derived from the provided/support geometry depending on cell_kind.

Return type:

Domain

Parameters:
  • provided (geopandas.GeoDataFrame | DataFrame)

  • name (str)

  • mode (Literal['sites', 'cellset'] | None)

  • id_col (str)

  • support (geopandas.GeoDataFrame | DataFrame | None)

  • cells (geopandas.GeoDataFrame | DataFrame | None)

  • cell_kind (Literal['site_points', 'as_provided'] | None)

  • domain_nc (str | Path | None)

  • path_out (str | Path | None)

  • run_group (str | None)

property gdf: geopandas.GeoDataFrame

Backwards-compat alias for older code.

Historically dapper exposed a single GeoDataFrame on the domain object (often a df_loc-style table with columns like gid/lon/lat/geometry). The closest equivalent is cells (the run-level geometry table).

property gids: list[str]

List of gid values (as strings) for the current cell table.

property group_name: str

Name of the output group directory for this Domain.

has_topounits()[source]

Return True if this Domain has a non-empty topounits table attached.

Return type:

bool

iter_runs()[source]

Yield (run_id, run_domain) where run_domain is always a single-run ‘cellset’ Domain. - mode=’cellset’ -> yields exactly one run (self) - mode=’sites’ -> yields one run per gid (single-row Domains)

make_topounits(*, binning, sources=None, combine='cartesian', combine_order=None, max_topounits=256, dem_source='arcticdem', export_scale='native', min_patch_pixels=None, target_pixels_per_topounit=500, target_scale=None, verbose=False, allow_slow_ncells=25)[source]

Convenience wrapper that computes topounits for this Domain and returns a new Domain with domain.topounits attached.

  • If this Domain has multiple rows (cellset/sites), it computes topounits per-row (per gid) using dapper.topounit.topomake.make_topounits_for_domain.

  • If this Domain has one row, it still goes through the same path (safe + consistent).

Users should not need to deal with ee.Geometry vs ee.Feature vs FeatureCollection here.

Return type:

Domain

Parameters:
  • binning (dict)

  • sources (list[str] | None)

  • combine (str)

  • max_topounits (int)

  • dem_source (str)

  • export_scale (str)

  • target_pixels_per_topounit (int)

  • target_scale (float | None)

  • verbose (bool)

  • allow_slow_ncells (int)

property met_dir: Path

Convenience property for Domain.path_met_dir().

met_support: Optional[gpd.GeoDataFrame] = None
mode: DomainMode
name: str
path_domain_nc(filename='domain.nc', run_id=None)[source]

Default output path for the domain NetCDF for this Domain (or a specific run_id).

Return type:

Path

Parameters:
  • filename (str)

  • run_id (str | None)

path_landuse_nc(filename='landuse_timeseries.nc', run_id=None)[source]

Default output path for the landuse NetCDF for this Domain (or a specific run_id).

Return type:

Path

Parameters:
  • filename (str)

  • run_id (str | None)

path_met_dir(run_id=None)[source]

Path to the MET output directory for this Domain (or a specific run_id).

Return type:

Path

Parameters:

run_id (str | None)

path_out: Optional[Path] = None
path_surface_nc(filename='surfdata.nc', run_id=None)[source]

Default output path for the surface NetCDF for this Domain (or a specific run_id).

Return type:

Path

Parameters:
  • filename (str)

  • run_id (str | None)

path_zone_mappings(filename='zone_mappings.txt', run_id=None)[source]

Default output path for zone_mappings.txt (under the MET directory).

Return type:

Path

Parameters:
  • filename (str)

  • run_id (str | None)

provided: gpd.GeoDataFrame
rep_points(*, source='support', step=None)[source]

Representative points for a given geometry view. If source=’support’ and step is provided, uses the prepared support for that step.

Return type:

GeoDataFrame

Parameters:
  • source (Literal['provided', 'support', 'cells'])

  • step (Literal['met', 'topounits'] | None)

property run_dir: Path

Directory holding the main run outputs for this Domain instance.

For top-level cellset or top-level sites container:

path_out/<group_name>

For per-site/per-cellset run domains (created by iter_runs in sites mode):

path_out/<group_name>/<domain.name>

run_group: Optional[str] = None
simplify_support(tolerance_m, *, step, preserve_topology=True, equal_area_epsg=6933)[source]

Simplify the support geometry for a step and store it as met_support/topo_support. Does NOT modify provided/support/cells.

Return type:

Domain

Parameters:
  • tolerance_m (float)

  • step (Literal['met', 'topounits'])

  • preserve_topology (bool)

  • equal_area_epsg (int)

support: gpd.GeoDataFrame
support_for(*, step=None)[source]

Return the geometry set that should be used for the given step. - step=None -> support - step=”met” -> met_support if set else support - step=”topounits” -> topo_support if set else support

Return type:

GeoDataFrame

Parameters:

step (Literal['met', 'topounits'] | None)

to_df_loc(*, lon_col='lon', lat_col='lat', weight_col='weight', frac_col='frac', default_weight=1.0)[source]

Derived location/weight table from cells (internal glue; users shouldn’t need to touch).

Return type:

DataFrame

Parameters:
  • lon_col (str)

  • lat_col (str)

  • weight_col (str)

  • frac_col (str)

  • default_weight (float)

topo_support: Optional[gpd.GeoDataFrame] = None
topounits: gpd.GeoDataFrame | None = None
topounits_dim_name: str = 'topounit'
topounits_for_gid(gid)[source]

Return the topounits subset for a single gid (or None if no topounits).

Parameters:

gid (str)

topounits_gid_col: str = 'gid'
topounits_id_col: str = 'topounit_id'
with_step_support(step, gdf)[source]

Attach a step-specific support GeoDataFrame (for “met” or “topounits”).

Return type:

Domain

Parameters:
  • step (Literal['met', 'topounits'])

  • gdf (geopandas.GeoDataFrame)

with_topounits(topounits, *, id_col='band_name', gid_col='gid', dim_name='topounit')[source]

Attach topounits GeoDataFrame to this Domain. - Ensures a stable id column name (self.topounits_id_col == ‘topounit_id’) - Ensures gid linkage column exists (self.topounits_gid_col)

Return type:

Domain

Parameters:
  • topounits (geopandas.GeoDataFrame)

  • id_col (str)

  • gid_col (str)

  • dim_name (str)

class dapper.ERA5Adapter[source]

Bases: BaseAdapter

ERA5-Land → ELM adapter.

This adapter implements the BaseAdapter interface for ERA5-Land hourly data. It handles source-specific details—file discovery, unit conversions, humidity diagnostics, renaming to ELM short names, and nonnegativity enforcement, so the upstream Exporter can remain source-agnostic.

Responsibilities

  • discover_files: Find CSV shards in a directory and infer the overall (start_year, end_year) using their date coverage.

  • normalize_locations: Validate and normalize the locations table (adds lon_0-360, ensures/creates zone, stable sorting).

  • id_column_for_csv: Declare the identifier column name in the input CSVs. For ERA5 we require gid.

  • preprocess_shard: Convert one merged shard (CSV rows joined to locations) into canonical ELM columns. Steps include:

    1. time filtering and optional “noleap” removal of Feb 29

    2. ERA5→ELM unit conversions (e.g., J/hr/m² → W/m², m/hr → mm/s)

    3. optional humidity computation (RH/Q) if temperature, dewpoint, and surface pressure are available

    4. renaming raw ERA5 fields to ELM short names via a mapping

    5. clipping canonical nonnegative variables

    6. returning only required columns in a deterministic order

  • required_vars: Report the canonical ELM variable names required for the requested output format.

  • pack_params: Provide robust (add_offset, scale_factor) for a canonical ELM variable, given optional data to tune ranges.

Notes

  • Humidity computation is performed only when temperature_2m, dewpoint_temperature_2m, and surface_pressure are present.

  • Precipitation conversion uses m/hr mm/s via division by 3.6.

DRIVER_TAG = 'ERA5'
SOURCE_NAME = 'ERA5-Land hourly reanalysis'
discover_files(csv_directory, calendar)[source]

Discover ERA5 CSV shards in a directory and infer the inclusive year range.

id_column_for_csv(df_csv, id_col)[source]

Return the required identifier column name expected in ERA5 CSV shards (“gid”).

pack_params(elm_var, data=None)[source]

Return (add_offset, scale_factor) used to pack a variable for NetCDF output.

preprocess_shard(df_merged, start_year, end_year, calendar, dformat)[source]
  1. Filter time & handle no-leap

  2. Apply ERA5 → ELM unit conversions

  3. Compute humidities (if columns available)

  4. Rename columns to canonical ELM names using RAW_TO_ELM

  5. Clip canonical nonnegative variables

  6. Return only the canonical vars required by elm_required_vars(dformat), plus LONGXY/LATIXY/time/gid/zone (coords/meta).

required_vars(dformat)[source]

Return the canonical ELM variables required for the requested output format.

class dapper.Exporter(adapter, src_path, *, domain, out_dir=None, calendar='noleap', dtime_resolution_hrs=1, dtime_units='days', dformat='BYPASS', append_attrs=None, chunks=None, include_vars=None, exclude_vars=None)[source]

Bases: object

Source-agnostic meteorological exporter.

This class orchestrates a two-pass pipeline that ingests time-sharded CSVs for many sites/cells, preprocesses them via a pluggable adapter, and writes ELM-ready NetCDF outputs in two layouts:

  1. "cellset" – one NetCDF per variable with dims ('DTIME','lat','lon') (global packing; sparse lat/lon axes are OK).

  2. "sites" – one directory per site; each directory contains one NetCDF per variable with dims ('n','DTIME') where n=1 (per-site packing).

Exporter is source-agnostic: all dataset-specific logic (file discovery, unit conversions, renaming to ELM short names, etc.) lives in an adapter that implements the BaseAdapter interface (e.g., an ERA5Adapter). The exporter handles staging (CSV → per-site parquet), global DTIME axis creation, packing scans, chunking, and NetCDF I/O.

Parameters:
  • adapter (BaseAdapter) – Implements: discover_files, normalize_locations, preprocess_shard, required_vars, and pack_params.

  • csv_directory (str or pathlib.Path) – Directory containing time-sharded CSV files for all sites/cells.

  • out_dir (str or pathlib.Path) – Destination directory for NetCDF outputs and temporary parquet shards.

  • df_loc (pandas.DataFrame) – Locations table with at least columns ["gid","lat","lon"]; optional "zone". The adapter’s normalize_locations: - validates columns, - adds "lon_0-360", - fills/validates "zone", - sorts for stable site order.

  • id_col (str, optional) – Kept for backward compatibility (unused when "gid" is assumed).

  • calendar ({"noleap","standard"}, default "noleap") – Calendar for numeric DTIME coordinate; Feb 29 filtered for “noleap”.

  • dtime_resolution_hrs (int, default 1) – Target time resolution in hours for the DTIME axis.

  • dtime_units ({"days","hours"}, default "days") – Units of the numeric DTIME coordinate (e.g., "days since YYYY-MM-DD HH:MM:SS").

  • domain (Domain)

  • dformat (str)

  • append_attrs (dict | None)

dformat{“BYPASS”,”DATM_MODE”}, default “BYPASS”

Target ELM format selector passed through to the adapter.

append_attrsdict, optional

Extra global NetCDF attributes to include in every file. The exporter also adds: export_mode ("cellset" or "sites") and pack_scope ("global" or "per-site").

chunkstuple[int,…], optional

Explicit NetCDF chunk sizes.

include_vars / exclude_varsIterable[str], optional

Allow-/block-lists of ELM short names applied after preprocess. Meta columns {"gid","time","LATIXY","LONGXY","zone"} are always kept.

Side Effects

  • Creates a temporary directory of per-site parquet shards under out_dir.

  • Writes NetCDF files to out_dir in the chosen layout.

  • Writes a zone_mappings.txt file either at the root (cellset) or inside each site directory (sites).

Notes

  • Packing: global packing for cellset; per-site packing for sites.

  • Required columns: CSV shards and df_loc both use "gid"; CSVs include the adapter’s date/time column (renamed to "time" during preprocess).

  • Combined (lat/lon) layout: does not enforce regular grids; axes are the unique sorted lat/lon from df_loc (sparse OK).

run(*, pack_scope=None, filename=None, overwrite=False)[source]

Run the MET export for this exporter’s Domain.

The output layout is derived from Domain.mode:
  • sites: writes <run_dir>/<gid>/MET/{prefix_}{var}.nc and a per-site zone_mappings.txt (always zone=01, id=1).

  • cellset: writes <run_dir>/MET/{prefix_}{var}.nc and a single zone_mappings.txt covering all locations (zones taken from df_loc, default 1).

Parameters:
  • pack_scope – Optional packing strategy override. Defaults to per-site for sites and global for cellset outputs.

  • filename (str | None) – Optional filename prefix for output NetCDF files. If provided, each variable is written to {filename}_{var}.nc.

  • overwrite (bool) – If True, clears existing MET outputs before writing.

Return type:

None

dapper.sample_e5lh(params, domain_name=None, skip_tasks=False)[source]

Submit Google Earth Engine (GEE) export tasks for ERA5-Land Hourly time series.

This prepares the ERA5-Land Hourly ImageCollection ("ECMWF/ERA5_LAND/HOURLY"), validates bands, ensures each geometry samples at least one pixel center (falling back to points when needed), batches the requested date range into N-year chunks, and (unless skip_tasks=True) starts one Drive export task per batch.

Parameters:
  • params (dict) –

    Configuration dictionary. Expected keys (case-sensitive):

    • start_date (str): Start date in "YYYY-MM-DD".

    • end_date (str): End date in "YYYY-MM-DD".

    • geometries: One of the following:

      • str: GEE asset ID for a FeatureCollection (e.g., "users/me/my_fc").

      • ee.FeatureCollection: a pre-constructed collection.

      • GeoDataFrame: must contain geometry and an ID column (see geometry_id_field).

      • AOI: dapper.domains.aoi.AOI instance; uses its internal GeoDataFrame.

      • Domain: dapper.domains.domain.Domain instance; uses Domain.to_geometries().

    • geometry_id_field (str, optional): ID column in provided geometries. Defaults to "gid". Values are copied into the "gid" property on each feature.

    • gee_bands (str or list[str]): Which ERA5-Land bands to export. One of:

      • "all": all available bands (from era5.ALL_BANDS)

      • "elm": bands required to derive ELM variables (from era5.REQUIRED_RAW_BANDS)

      • a list of band names validated against the collection

    • gdrive_folder (str): Google Drive folder name where CSV chunks are written.

    • job_name (str): Base name used to build per-batch export descriptions/filenames.

    • gee_scale (str or int or float): Sampling scale in meters. If "native" (or a value < 11132), the native ERA5-Land scale of 11132 m is used.

    • gee_years_per_task (int, optional): Years per export batch (default: 5).

    The function sets params["gee_ic"] = "ECMWF/ERA5_LAND/HOURLY" internally.

  • domain_name (str, optional) – Optional name for the returned Domain.

  • skip_tasks (bool, default False) – If True, do everything except starting the GEE export tasks.

Returns:

Domain describing the sampling locations. The underlying GeoDataFrame contains at least "gid", "lon", and "lat".

Return type:

Domain

Notes

  • Call ee.Initialize() before using this function.

  • CSV selectors include ["gid", "date"] + params["gee_bands"].

  • Dates are derived from system:time_start and formatted in UTC.

Raises:
  • KeyError – If required keys are missing from params.

  • ValueError – If dates are malformed or geometries is an unsupported type.

  • TypeError – If gee_scale is not "native" and not numeric.

  • ee.EEException – Propagated Earth Engine errors (e.g., authentication, export quota).

Examples

params = {
    "start_date": "1950-01-01",
    "end_date": "1951-12-31",
    "geometries": "users/me/my_sites_fc",
    "geometry_id_field": "gid",
    "gee_bands": "elm",
    "gee_scale": "native",
    "gee_years_per_task": 5,
    "gdrive_folder": "era5_exports",
    "job_name": "era5l_sites",
}
domain = sample_e5lh(params)
domain.gdf.head()