# You don't need to run this--it's just a workaround so images show up in these GitHub-rendered notebooks.
# (It just returns markdown that points at an existing local image file.)
from dapper.dev.notebooks import display_image_gh_notebook
from IPython.display import Markdown
Demonstrating scalability of met sampling¶
Note that this notebook is slightly outdated, but the point it tries to demonstrate is still intact.
Refactor note: In the current version of dapper, this workflow uses a Domain in mode='cellset'.
That tells the exporter to write one combined set of MET NetCDFs for the full grid (instead of one per cell/site).
This notebook demonstrates how we can create met outputs for a fairly large domain (Kuparuk watershed) using grid cells that correspond exactly to ERA5-Land hourly’s native grid spacing. The idea is to give you a sense of what’s possible in terms of scaling with dapper. I suggest not actually running the code here, but instead just reading through the notebook.
In this notebook, we will consider the watershed of the streamflow gage near the outlet of the Kuparuk River. Our goal will be to sample ERA5-Land hourly data at the native ERA5 resolution for all grid cells that intersect this watershed. It is OK not to have rectangular domains when providing ELM met data, as ELM searches for the nearest point in the netCDF to extract its met data. Therefore, you can have any sets of latitutdes and longitudes. That’s what we’ll do here to avoid fetching and reformatting ERA5-Land hourly data that we don’t need. It mostly happens under the hood, anyway. :)
Here is our target watershed, the Kuparuk.
display(Markdown(display_image_gh_notebook('3-kuparak.jpg')))
First, we want to find all of the ERA5-Land hourly pixels or cells that intersect our watershed. I’ve already done some preprocessing of the ERA5-Land hourly grid that we can use. I have made these GEE Assets public, so you can reference them in any of your GEE code. Here are their addresses; all of this is already in the code so this is just for reference:
Center point of each ERA5-Land hourly cell:
https://code.earthengine.google.com/?asset=projects/ee-jonschwenk/assets/E3SM/e5lh_centerpointsPolygons of each ERA5-Land hourly cell:
https://code.earthengine.google.com/?asset=projects/ee-jonschwenk/assets/E3SM/e5lh_grid
Here’s a plot of the ERA5-Land hourly grids and points for our watershed.
display(Markdown(display_image_gh_notebook('3-kuparak-era5.jpg')))
In the following code block, we will intersect the Kuparuk gage’s watershed with the ERA5-Land hourly polygons and sample met data from each intersecting polygon. I am going to jump the gun and show you here which polygons will be sampled.
display(Markdown(display_image_gh_notebook('3-kuparak-era5-intersecting.jpg')))
Ok, now we can get to the actual code.
Set up our request via a dictionary of parameters¶
Some of the code here may be outdated.
We will build a dictionary of params that we’ll pass to the sample_e5lh function. There are a few new ones to take note of here.
geometries: this is the same as before; you can provide a GeoDataFrame containing geometries you want to sample, or a string representing a path to a GEE Asset, or a ee.FeatureCollection() object. We’ll do the last one in this example.geometry_id_field: this is a string that points to the unique ID field in the GeoDataFrame, Asset, or FeatureCollection. This is optional to provide, but if you don’t provide it,ngeegeewill try to figure out which column should be used.ngeegeeis pretty dumb, so it’s recommended you just specify this.
All other parameters were described in ERA5-Land hourly at site(s) to ELM-ready.
I’ve uploaded a shapefile of the Kuparuk to ngeegee’s data folder, so you should have it and the following code should just work.
import ee
from pathlib import Path
import geopandas as gpd
from dapper import sample_e5lh
# -----------------------------------------------------------------------------
# 0) GEE init
# -----------------------------------------------------------------------------
# IMPORTANT: set this to *your* GEE project (do not use mine).
ee.Initialize(project="<YOUR_GEE_PROJECT>")
# -----------------------------------------------------------------------------
# 1) Load a watershed polygon (or any AOI polygon)
# -----------------------------------------------------------------------------
# Point this at your own polygon file (shp / gpkg / geojson).
watershed_path = Path(r"<PATH_TO_WATERSHED_POLYGON_FILE>")
watershed = gpd.read_file(watershed_path).to_crs("EPSG:4326")
watershed = watershed.buffer(0) # optional: can help repair minor geometry issues
# Convert first feature to an ee.Geometry
pgon = watershed.geometry.iloc[0]
watershed_ee = ee.Geometry.Polygon(list(pgon.exterior.coords))
# -----------------------------------------------------------------------------
# 2) Find the ERA5-Land-Hourly grid cells that intersect the watershed
# -----------------------------------------------------------------------------
# This example assumes you have (or can access) a FeatureCollection of *grid-cell polygons*
# with an ID field (here: "pids"). If you don't have one, you'll need to build/provide it.
e5lh_grid_asset = "projects/ee-jonschwenk/assets/E3SM/e5lh_grid" # example (public in the original notebook)
e5lh_grid = ee.FeatureCollection(e5lh_grid_asset)
intersecting = e5lh_grid.filter(ee.Filter.intersects(".geo", watershed_ee))
pids = intersecting.aggregate_array("pids").getInfo()
sample_these_cells = e5lh_grid.filter(ee.Filter.inList("pids", pids))
print(f"Intersecting grid cells: {len(pids)}")
# -----------------------------------------------------------------------------
# 3) Submit the GEE export tasks (raw CSV shards) + get a Domain back
# -----------------------------------------------------------------------------
# `sample_e5lh(...)` starts the Drive export tasks (unless skip_tasks=True) and returns
# a Domain describing the geometries you sampled (gid/lon/lat/geometry).
params = {
"start_date": "1950-01-01", # YYYY-MM-DD
"end_date": "1957-01-01", # YYYY-MM-DD
"geometries": sample_these_cells,
"geometry_id_field": "pids", # the grid asset uses "pids" as its unique ID column
"gee_bands": "elm", # ELM-required bands
"gee_years_per_task": 1, # smaller tasks (more of them)
"gee_scale": "native",
"gdrive_folder": "ngee_test_cells",
"job_name": "kuparuk_cellset",
}
# Set skip_tasks=False to actually submit tasks to your Google Drive.
domain_sites = sample_e5lh(params, domain_name="kuparuk_gridded", skip_tasks=True)
# Quick sanity check: this is the per-cell geometry table (one row per intersecting grid cell)
domain_sites.cells.head()
You will be sampling over 245 geometries.
Your request will be executed as 7 Tasks in Google Earth Engine.
Export task submitted: cell_test_1950-01-01_1951-01-01
Export task submitted: cell_test_1951-01-01_1952-01-01
Export task submitted: cell_test_1952-01-01_1953-01-01
Export task submitted: cell_test_1953-01-01_1954-01-01
Export task submitted: cell_test_1954-01-01_1955-01-01
Export task submitted: cell_test_1955-01-01_1956-01-01
Export task submitted: cell_test_1956-01-01_1957-01-01
'All export tasks started. Check Google Drive or Task Status in the Javascript Editor for completion.'
Now we wait.¶
We’ve fired off some jobs to GEE, and now we wait. For me, this took about 2-3 hours to fully export. You can wait if you want, but I’ve also just zipped all the csvs that will export from these jobs and uploaded them to my GDrive. You can just download this and unzip it and move on (but don’t forget to kill your GEE Tasks). You can run kill_all_tasks() from dapper.integrations.earthengine.gee_utils to do this, or click on each one in the GEE Web Browser Task panel.
Now we continue.¶
Once you have the data (7 csv files) in a local directory, you can move on.
Similar to Notebook 2, we will need to create a df_loc GeoDataFrame that is necessary for us to locate each geometry and ultimately create the netCDFs.
We then just specify some read/write paths and a few keyword arguments before calling e5hl_to_elm_gridded. Here I’ll discuss some of the keyword arguments.
df_loc: this needs to havelon,lat, and an id column (pidsfor us). If your geometries are irregular polygons, you probably want to use centroids. The lat/lon values you put here are what the met data will be referenced to in the exported netCDF.remove_leap: ifTrue, this will remove leap days before exporting time series.id_col: similar to above, if you don’t provide this,dapperwill make a guess. Best practice is to just provide it.nzones: I am not 100% sure about this one, but according to Fengming, you might want to divide your ELM run into zones. He suggested that you’d use as many zones as you have processors to run on. The number of netCDFs that are exported will multiply by the number of zones you specify (e.g. 2 zones will export 14 files–7 variables x 2 zones). You can always re-run this with a different number of zones if you get something you’re not expecting. More zones means smaller (but more) files.compress: IfTrue, this will compress the netCDFs. I think generally you want to compress these files to save space, especiallly if computing on many geometries over long times.compress_level: can be set to 0-9, with higher numbers being more compression. More compression means it might take longer to export the files, but in my testing the difference between 4 and 9 was negligible.
That’s it. The other thing you should be aware of is that the netCDFs are packed under the hood. This is essentialy rescaling the data so they can be stored as int16 instead of float, saving space. The parameters (add_offset and scale_factor) of this packing are stored in each netCDF itself, and when the netCDF is loaded, it automatically unpacks the data. At least it does when I open datasets using xarray. You do not need to worry about this packing process, but you should be aware that it’s happening.
Refactor note: In current dapper, the GEE step returns a Domain (not a df_loc table).
For a gridded export, we simply switch that Domain to mode="cellset" and then call Domain.export_met(...).
from pathlib import Path
from dapper import ERA5Adapter
# -----------------------------------------------------------------------------
# Convert the Domain returned by `sample_e5lh(...)` into a *cellset* Domain
# -----------------------------------------------------------------------------
# `sample_e5lh` returns a Domain in mode='sites' (one row per feature). For a gridded export,
# we want one combined output (mode='cellset').
domain = domain_sites.copy(mode="cellset")
# -----------------------------------------------------------------------------
# Local paths (after you download the GEE CSVs from Drive)
# -----------------------------------------------------------------------------
csv_directory = Path(r"<PATH_TO_DOWNLOADED_GEE_CSVS>") # raw GEE CSV shards
write_directory = Path(r"<PATH_TO_OUTPUT_DIRECTORY>") # MET outputs will land in: write_directory / "MET"
write_directory.mkdir(parents=True, exist_ok=True)
# -----------------------------------------------------------------------------
# Export ELM-ready gridded MET NetCDFs
# -----------------------------------------------------------------------------
adapter = ERA5Adapter()
met_outputs = domain.export_met(
src_path=csv_directory,
adapter=adapter,
out_dir=write_directory,
overwrite=False,
calendar="noleap",
dtime_resolution_hrs=1,
dformat="BYPASS",
append_attrs={"dapper_example": "watershed_gridded_cellset"},
)
met_dir = list(met_outputs.values())[0]
print("MET outputs written to:", met_dir)
Processing file 1 of 7: cell_test_1950-01-01_1951-01-01.csv
Processing file 2 of 7: cell_test_1951-01-01_1952-01-01.csv
Processing file 3 of 7: cell_test_1952-01-01_1953-01-01.csv
Processing file 4 of 7: cell_test_1953-01-01_1954-01-01.csv
Processing file 5 of 7: cell_test_1954-01-01_1955-01-01.csv
Processing file 6 of 7: cell_test_1955-01-01_1956-01-01.csv
Processing file 7 of 7: cell_test_1956-01-01_1957-01-01.csv
And that’s it!
Some analysis of our exports¶
Here is a screenshot of my write_directory after running Domain.export_met:
display(Markdown(display_image_gh_notebook('3-export-proof.jpg')))
You can see that we have 7 met var files (good). Each has a different filesize on disk. This is expected as packing and compression are more/less effective depending on the data themselves. You also see a new file called zone_mappings.txt. This is needed for OLMT runs and is where zones are specified. If you specified more than 1 zone, you will see files that end with zN where N is the number of each zone.