Skip to content

Adding custom datasets

This guide explains how to add a new dataset source to your Climate API instance — for example a national meteorological service, a regional satellite product, or a custom model output.

The built-in dataset templates (CHIRPS3, ERA5-Land, WorldPop) ship as package data. Custom datasets are layered on top by pointing plugins_dir in your climate-api.yaml at a plugins directory. That directory serves two purposes: YAML dataset templates go in its datasets/ subfolder, and Python modules placed directly under it are importable by their dotted path (e.g. mypackage.sources.download) without installing them as a package.

Overview

Adding a custom dataset involves two things:

  1. A download function — a Python function that downloads data and writes it as one or more NetCDF files to a given directory.
  2. A dataset template YAML — a file that describes the dataset and tells the API which download function to call.

Step 1: Write the download function

The download function must be importable as a dotted Python path. The API calls it with keyword arguments and ignores the return value — the function is expected to write NetCDF files to dirname using prefix as the filename prefix.

# mypackage/sources/enacts.py
from pathlib import Path

def download(
    *,
    start: str,         # ISO 8601 date or datetime
    end: str,
    dirname: Path,      # directory to write output files into
    prefix: str,        # filename prefix (use e.g. f"{prefix}_{year}.nc")
    overwrite: bool,
    bbox: list[float],  # [xmin, ymin, xmax, ymax] — include only if your source needs it
    **kwargs: object,   # absorbs default_params from the YAML template
) -> None:
    """Download ENACTS rainfall and write NetCDF files to dirname."""
    ...

Required parameters — always passed by the API:

Parameter Type Description
start str Start of the requested time range (ISO 8601)
end str End of the requested time range (ISO 8601)
dirname Path Directory to write output NetCDF files into
prefix str Filename prefix for output files
overwrite bool Whether to overwrite existing cached files

Optional parameters — passed only when present in the function signature:

Parameter Type Description
bbox list[float] Bounding box as [xmin, ymin, xmax, ymax] — include this if your source requires a spatial filter
country_code str ISO 3166-1 alpha-3 code — include this if your source (e.g. WorldPop) requires a country code

Any extra keyword arguments from ingestion.default_params in the YAML template are forwarded as additional kwargs.

The API normalises coordinate names at write time: valid_timetime, lat/latitudey, lon/longitudex. Using the canonical names in your output avoids any ambiguity, but upstream names are handled automatically.

Install your package in the same environment as the Climate API:

pip install ./mypackage

Step 2: Create a dataset template YAML

Create a directory for your custom templates and add a YAML file. Each file contains a list of templates (even if there is only one):

# datasets/enacts_rainfall.yaml
- id: enacts_rainfall_daily
  name: ENACTS Rainfall (daily)
  short_name: Rainfall
  variable: rainfall
  period_type: daily
  sync:
    kind: temporal
    execution: append
  ingestion:
    function: mypackage.sources.enacts.download
  units: mm
  resolution: 4 km x 4 km
  source: ENACTS
  source_url: https://enacts.example.org

Template field reference

Identity

Field Required Description
id Yes Unique template identifier. This becomes the dataset ID in the API, e.g. enacts_rainfall_daily
name Yes Full human-readable name shown in API responses and STAC metadata
short_name No Short label used in compact displays
variable Yes Name of the data variable in the Zarr store (e.g. precip, t2m, rainfall)
source No Name of the upstream data source
source_url No URL to the upstream dataset documentation or landing page

Period and sync

Field Required Description
period_type Yes Temporal resolution: hourly, daily, monthly, yearly
sync.kind Yes temporal — data grows over time; release — versioned releases; static — never synced
sync.execution No append — new time steps appended to existing store; rematerialize — full rebuild on each sync
sync.availability No Provider availability policy — see below

Sync availability — how the API determines the latest available data:

sync:
  kind: temporal
  execution: append
  availability:
    latest_available_function: climate_api.providers.availability.lagged_latest_available
    lag_hours: 48
Field Description
latest_available_function Dotted path to a built-in availability function in climate_api.providers.availability
lag_hours / lag_days Data is delayed by this many hours or days
allow_future Allow requesting future dates (e.g. forecasts or projections). Default: false

Omit sync.availability entirely for static datasets or when you always want to sync up to the requested end date.

Ingestion

Field Required Description
ingestion.function Yes Dotted path to the download function
ingestion.default_params No Extra keyword arguments forwarded to the download function

Transforms — applied after download, before writing to Zarr:

transforms:
  - climate_api.transforms.kelvin_to_celsius
  - mypackage.transforms.my_custom_transform

See Transforms for the full pipeline description, built-in options, and how to write a custom transform.

Spatial and temporal extents — declares what the source dataset covers. Used to validate ingest requests before hitting the provider:

extents:
  spatial:
    bbox: [-180, -50, 180, 50]   # [xmin, ymin, xmax, ymax] in WGS84
    crs: http://www.opengis.net/def/crs/OGC/1.3/CRS84
  temporal:
    begin: "1981-01-01"
    end: "2030-12-31"            # omit if ongoing
    trs: http://www.opengis.net/def/uom/ISO-8601/0/Gregorian
    resolution: P1D              # ISO 8601 duration: PT1H, P1D, P1M, P1Y

If an ingest request's bounding box has no overlap with extents.spatial.bbox, the API returns HTTP 400 immediately. Partial overlap is allowed — the provider will return data for the intersecting area.

Units and display

Field Required Description
units No Physical units of the stored data (e.g. mm, degC, m)
resolution No Human-readable spatial resolution (e.g. 5 km x 5 km)
display.colormap No Colormap name for map rendering (e.g. blues, rdbu_r)
display.range No [min, max] display range for the colormap
display.nodata No No-data / fill value

Multiscale pyramid — pyramid Zarr stores are built automatically when the ingested dataset's spatial dimensions exceed 2048×2048 pixels. No YAML configuration is required; the pyramid level count is derived from the data shape and coarsening always uses mean aggregation.

Step 3: Point the instance at your plugins directory

Add plugins_dir to your climate-api.yaml and place your YAML file in the datasets/ subfolder:

plugins/
└── datasets/
    └── enacts_rainfall.yaml
extent:
  name: Rwanda
  bbox: [28.8, -2.9, 30.9, -1.0]

data_dir: ./data
plugins_dir: ./plugins/

All *.yaml files in plugins_dir/datasets/ are loaded and merged with the built-in templates (CHIRPS3, ERA5-Land, WorldPop). Custom templates are additive — the built-ins remain available unless you deliberately override one by using the same id.

Step 4: Ingest and publish

Once the API is running with CLIMATE_API_CONFIG pointing to your updated config, ingest as usual:

curl -s -X POST http://127.0.0.1:8000/ingestions \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_id": "enacts_rainfall_daily",
    "start": "2024-01-01",
    "end": "2024-01-31",
    "prefer_zarr": true,
    "publish": true
  }' | jq

Verify it appears in the STAC catalog:

curl -s http://127.0.0.1:8000/stac/catalog.json | jq '.links[] | select(.rel == "child")'

Minimal example

The smallest valid template for a static dataset with no sync:

- id: my_static_dataset
  name: My static dataset
  variable: value
  period_type: daily
  sync:
    kind: static
  ingestion:
    function: mypackage.sources.my_source.download