{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "5712e557",
"metadata": {},
"outputs": [],
"source": [
"# You don't need to run this--it's just a workaround so images show up in these GitHub-rendered notebooks.\n",
"# (It just returns markdown that points at an existing local image file.)\n",
"from dapper.dev.notebooks import display_image_gh_notebook\n",
"from IPython.display import Markdown"
]
},
{
"cell_type": "markdown",
"id": "8be352fc-1d78-4e9a-a476-c4ce9957ef7a",
"metadata": {},
"source": [
"# Demonstrating scalability of met sampling\n",
"\n",
"Note that this notebook is slightly outdated, but the point it tries to demonstrate is still intact.\n",
"\n",
"**Refactor note:** In the current version of `dapper`, this workflow uses a `Domain` in `mode='cellset'`.\n",
"That tells the exporter to write **one combined set** of MET NetCDFs for the full grid (instead of one per cell/site).\n",
"\n",
"This notebook demonstrates how we can create met outputs for a fairly large domain (Kuparuk watershed) using grid cells that correspond exactly to ERA5-Land hourly's native grid spacing. The idea is to give you a sense of what's possible in terms of scaling with `dapper`. I suggest not actually running the code here, but instead just reading through the notebook.\n",
" \n",
"In this notebook, we will consider the watershed of the streamflow gage near the outlet of the Kuparuk River. Our goal will be to sample ERA5-Land hourly data at the native ERA5 resolution for all grid cells that intersect this watershed. It is OK not to have rectangular domains when providing ELM met data, as ELM searches for the nearest point in the netCDF to extract its met data. Therefore, you can have any sets of latitutdes and longitudes. That's what we'll do here to avoid fetching and reformatting ERA5-Land hourly data that we don't need. It mostly happens under the hood, anyway. :) \n",
"\n",
"Here is our target watershed, the Kuparuk."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b835fa6a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(Markdown(display_image_gh_notebook('3-kuparak.jpg')))"
]
},
{
"cell_type": "markdown",
"id": "d321ff29",
"metadata": {},
"source": [
"First, we want to find all of the ERA5-Land hourly pixels or cells that intersect our watershed. I've already done some preprocessing of the ERA5-Land hourly grid that we can use. I have made these GEE Assets public, so you can reference them in any of your GEE code. Here are their addresses; all of this is already in the code so this is just for reference:\n",
"\n",
"\n",
"- Center point of each ERA5-Land hourly cell: `https://code.earthengine.google.com/?asset=projects/ee-jonschwenk/assets/E3SM/e5lh_centerpoints`\n",
"\n",
"- Polygons of each ERA5-Land hourly cell: `https://code.earthengine.google.com/?asset=projects/ee-jonschwenk/assets/E3SM/e5lh_grid`\n",
"\n",
"Here's a plot of the ERA5-Land hourly grids and points for our watershed."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f12806a5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(Markdown(display_image_gh_notebook('3-kuparak-era5.jpg')))"
]
},
{
"cell_type": "markdown",
"id": "db5c2a05",
"metadata": {},
"source": [
"In the following code block, we will intersect the Kuparuk gage's watershed with the ERA5-Land hourly polygons and sample met data from each intersecting polygon. I am going to jump the gun and show you here which polygons will be sampled."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "4f6d0f76",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(Markdown(display_image_gh_notebook('3-kuparak-era5-intersecting.jpg')))"
]
},
{
"cell_type": "markdown",
"id": "49e6598e",
"metadata": {},
"source": [
"Ok, now we can get to the actual code. \n",
"\n",
"## Set up our request via a dictionary of parameters\n",
"\n",
"**Some of the code here may be outdated.**\n",
"\n",
"We will build a dictionary of `params` that we'll pass to the `sample_e5lh` function. There are a few new ones to take note of here.\n",
"\n",
"- `geometries` : this is the same as before; you can provide a GeoDataFrame containing geometries you want to sample, or a string representing a path to a GEE Asset, or a ee.FeatureCollection() object. We'll do the last one in this example.\n",
"- `geometry_id_field` : this is a string that points to the unique ID field in the GeoDataFrame, Asset, or FeatureCollection. This is optional to provide, but if you don't provide it, `ngeegee` will try to figure out which column should be used. `ngeegee` is pretty dumb, so it's recommended you just specify this.\n",
"\n",
"All other parameters were described in `ERA5-Land hourly at site(s) to ELM-ready`.\n",
"\n",
"I've uploaded a shapefile of the Kuparuk to `ngeegee`'s data folder, so you should have it and the following code should just work."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0fca8555-997f-4c28-a070-e4b55a8a167e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"You will be sampling over 245 geometries.\n",
"Your request will be executed as 7 Tasks in Google Earth Engine.\n",
"Export task submitted: cell_test_1950-01-01_1951-01-01\n",
"Export task submitted: cell_test_1951-01-01_1952-01-01\n",
"Export task submitted: cell_test_1952-01-01_1953-01-01\n",
"Export task submitted: cell_test_1953-01-01_1954-01-01\n",
"Export task submitted: cell_test_1954-01-01_1955-01-01\n",
"Export task submitted: cell_test_1955-01-01_1956-01-01\n",
"Export task submitted: cell_test_1956-01-01_1957-01-01\n"
]
},
{
"data": {
"text/plain": [
"'All export tasks started. Check Google Drive or Task Status in the Javascript Editor for completion.'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import ee\n",
"from pathlib import Path\n",
"\n",
"import geopandas as gpd\n",
"\n",
"from dapper import sample_e5lh\n",
"\n",
"# -----------------------------------------------------------------------------\n",
"# 0) GEE init\n",
"# -----------------------------------------------------------------------------\n",
"# IMPORTANT: set this to *your* GEE project (do not use mine).\n",
"ee.Initialize(project=\"\")\n",
"\n",
"# -----------------------------------------------------------------------------\n",
"# 1) Load a watershed polygon (or any AOI polygon)\n",
"# -----------------------------------------------------------------------------\n",
"# Point this at your own polygon file (shp / gpkg / geojson).\n",
"watershed_path = Path(r\"\")\n",
"watershed = gpd.read_file(watershed_path).to_crs(\"EPSG:4326\")\n",
"watershed = watershed.buffer(0) # optional: can help repair minor geometry issues\n",
"\n",
"# Convert first feature to an ee.Geometry\n",
"pgon = watershed.geometry.iloc[0]\n",
"watershed_ee = ee.Geometry.Polygon(list(pgon.exterior.coords))\n",
"\n",
"# -----------------------------------------------------------------------------\n",
"# 2) Find the ERA5-Land-Hourly grid cells that intersect the watershed\n",
"# -----------------------------------------------------------------------------\n",
"# This example assumes you have (or can access) a FeatureCollection of *grid-cell polygons*\n",
"# with an ID field (here: \"pids\"). If you don't have one, you'll need to build/provide it.\n",
"e5lh_grid_asset = \"projects/ee-jonschwenk/assets/E3SM/e5lh_grid\" # example (public in the original notebook)\n",
"e5lh_grid = ee.FeatureCollection(e5lh_grid_asset)\n",
"\n",
"intersecting = e5lh_grid.filter(ee.Filter.intersects(\".geo\", watershed_ee))\n",
"pids = intersecting.aggregate_array(\"pids\").getInfo()\n",
"\n",
"sample_these_cells = e5lh_grid.filter(ee.Filter.inList(\"pids\", pids))\n",
"print(f\"Intersecting grid cells: {len(pids)}\")\n",
"\n",
"# -----------------------------------------------------------------------------\n",
"# 3) Submit the GEE export tasks (raw CSV shards) + get a Domain back\n",
"# -----------------------------------------------------------------------------\n",
"# `sample_e5lh(...)` starts the Drive export tasks (unless skip_tasks=True) and returns\n",
"# a Domain describing the geometries you sampled (gid/lon/lat/geometry).\n",
"params = {\n",
" \"start_date\": \"1950-01-01\", # YYYY-MM-DD\n",
" \"end_date\": \"1957-01-01\", # YYYY-MM-DD\n",
" \"geometries\": sample_these_cells,\n",
" \"geometry_id_field\": \"pids\", # the grid asset uses \"pids\" as its unique ID column\n",
" \"gee_bands\": \"elm\", # ELM-required bands\n",
" \"gee_years_per_task\": 1, # smaller tasks (more of them)\n",
" \"gee_scale\": \"native\",\n",
" \"gdrive_folder\": \"ngee_test_cells\",\n",
" \"job_name\": \"kuparuk_cellset\",\n",
"}\n",
"\n",
"# Set skip_tasks=False to actually submit tasks to your Google Drive.\n",
"domain_sites = sample_e5lh(params, domain_name=\"kuparuk_gridded\", skip_tasks=True)\n",
"\n",
"# Quick sanity check: this is the per-cell geometry table (one row per intersecting grid cell)\n",
"domain_sites.cells.head()\n"
]
},
{
"cell_type": "markdown",
"id": "7e963f41",
"metadata": {},
"source": [
"## Now we wait.\n",
"\n",
"We've fired off some jobs to GEE, and now we wait. For me, this took about 2-3 hours to fully export. You can wait if you want, but I've also just zipped all the `csv`s that will export from these jobs and uploaded them to [my GDrive](https://drive.google.com/drive/folders/1tKCroPzeaTT0K6SEbl9YtQtkuZYs90Gq?usp=sharing). You can just download this and unzip it and move on (but don't forget to kill your GEE Tasks). You can run `kill_all_tasks()` from `dapper.integrations.earthengine.gee_utils` to do this, or click on each one in the GEE Web Browser Task panel.\n",
"\n",
"## Now we continue.\n",
"\n",
"Once you have the data (7 `csv` files) in a local directory, you can move on.\n",
"\n",
"Similar to Notebook `2`, we will need to create a `df_loc` GeoDataFrame that is necessary for us to locate each geometry and ultimately create the netCDFs.\n",
"\n",
"We then just specify some read/write paths and a few keyword arguments before calling `e5hl_to_elm_gridded`. Here I'll discuss some of the keyword arguments.\n",
"\n",
"- `df_loc` : this needs to have `lon`, `lat`, and an id column (`pids` for us). If your geometries are irregular polygons, you probably want to use centroids. The lat/lon values you put here are what the met data will be referenced to in the exported netCDF.\n",
"- `remove_leap` : if `True`, this will remove leap days before exporting time series.\n",
"- `id_col` : similar to above, if you don't provide this, `dapper` will make a guess. Best practice is to just provide it.\n",
"- `nzones` : I am not 100% sure about this one, but according to Fengming, you might want to divide your ELM run into zones. He suggested that you'd use as many zones as you have processors to run on. The number of netCDFs that are exported will multiply by the number of zones you specify (e.g. 2 zones will export 14 files--7 variables x 2 zones). You can always re-run this with a different number of zones if you get something you're not expecting. More zones means smaller (but more) files.\n",
"- `compress` : If `True`, this will compress the netCDFs. I think generally you want to compress these files to save space, especiallly if computing on many geometries over long times.\n",
"- `compress_level` : can be set to 0-9, with higher numbers being more compression. More compression means it might take longer to export the files, but in my testing the difference between 4 and 9 was negligible.\n",
"\n",
"That's it. The other thing you should be aware of is that the netCDFs are *packed* under the hood. This is essentialy rescaling the data so they can be stored as `int16` instead of `float`, saving space. The parameters (`add_offset` and `scale_factor`) of this packing are stored in each netCDF itself, and when the netCDF is loaded, it automatically *unpacks* the data. At least it does when I open datasets using `xarray`. You do not need to worry about this *packing* process, but you should be aware that it's happening.\n",
"\n",
"**Refactor note:** In current `dapper`, the GEE step returns a `Domain` (not a `df_loc` table).\n",
"For a gridded export, we simply switch that `Domain` to `mode=\"cellset\"` and then call `Domain.export_met(...)`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "62daec34",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Processing file 1 of 7: cell_test_1950-01-01_1951-01-01.csv\n",
"Processing file 2 of 7: cell_test_1951-01-01_1952-01-01.csv\n",
"Processing file 3 of 7: cell_test_1952-01-01_1953-01-01.csv\n",
"Processing file 4 of 7: cell_test_1953-01-01_1954-01-01.csv\n",
"Processing file 5 of 7: cell_test_1954-01-01_1955-01-01.csv\n",
"Processing file 6 of 7: cell_test_1955-01-01_1956-01-01.csv\n",
"Processing file 7 of 7: cell_test_1956-01-01_1957-01-01.csv\n"
]
}
],
"source": [
"from pathlib import Path\n",
"\n",
"from dapper import ERA5Adapter\n",
"\n",
"# -----------------------------------------------------------------------------\n",
"# Convert the Domain returned by `sample_e5lh(...)` into a *cellset* Domain\n",
"# -----------------------------------------------------------------------------\n",
"# `sample_e5lh` returns a Domain in mode='sites' (one row per feature). For a gridded export,\n",
"# we want one combined output (mode='cellset').\n",
"domain = domain_sites.copy(mode=\"cellset\")\n",
"\n",
"# -----------------------------------------------------------------------------\n",
"# Local paths (after you download the GEE CSVs from Drive)\n",
"# -----------------------------------------------------------------------------\n",
"csv_directory = Path(r\"\") # raw GEE CSV shards\n",
"write_directory = Path(r\"\") # MET outputs will land in: write_directory / \"MET\"\n",
"write_directory.mkdir(parents=True, exist_ok=True)\n",
"\n",
"# -----------------------------------------------------------------------------\n",
"# Export ELM-ready gridded MET NetCDFs\n",
"# -----------------------------------------------------------------------------\n",
"adapter = ERA5Adapter()\n",
"\n",
"met_outputs = domain.export_met(\n",
" src_path=csv_directory,\n",
" adapter=adapter,\n",
" out_dir=write_directory,\n",
" overwrite=False,\n",
" calendar=\"noleap\",\n",
" dtime_resolution_hrs=1,\n",
" dformat=\"BYPASS\",\n",
" append_attrs={\"dapper_example\": \"watershed_gridded_cellset\"},\n",
")\n",
"\n",
"met_dir = list(met_outputs.values())[0]\n",
"print(\"MET outputs written to:\", met_dir)\n"
]
},
{
"cell_type": "markdown",
"id": "05577061",
"metadata": {},
"source": [
"And that's it!\n",
"\n",
"## Some analysis of our exports\n",
"\n",
"Here is a screenshot of my `write_directory` after running `Domain.export_met`:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "b9b4f97e",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display(Markdown(display_image_gh_notebook('3-export-proof.jpg')))"
]
},
{
"cell_type": "markdown",
"id": "27251e0e",
"metadata": {},
"source": [
"You can see that we have 7 met var files (good). Each has a different filesize on disk. This is expected as packing and compression are more/less effective depending on the data themselves. You also see a new file called `zone_mappings.txt`. This is needed for `OLMT` runs and is where zones are specified. If you specified more than 1 zone, you will see files that end with `zN` where `N` is the number of each zone."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "dapper",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}