Skip to content

Architecture

This document explains how the Climate API is structured, why it is structured that way, and what the consequences are of each design decision. It is written for developers who will maintain or extend the platform over time.


Core concepts

The platform has four first-class concepts. Understanding the distinction between them is the foundation for understanding everything else.

Dataset template

A template is a YAML blueprint that describes a data source. Built-ins live in climate_api/data/datasets/ inside the package (loaded via importlib.resources). Custom templates live in {plugins_dir}/datasets/ where plugins_dir is set in climate-api.yaml. It has no state — it describes what could be ingested, not what has been ingested.

A template defines:

  • the dataset identifier and display metadata
  • the variable name, units, and period type
  • how to download the data (ingestion.function)
  • what transforms to apply (transforms)
  • what sync strategy to use (sync.kind, sync.execution)

Templates are config, not code. If a template needs custom logic, the logic goes into a Python function referenced by dotted path from the YAML.

Artifact

An artifact is the internal record of a completed data ingestion. It is the persistence layer — not a public API concept. Each ingestion produces exactly one artifact, which records:

  • what dataset template it came from
  • the exact spatial extent and time range that was materialized
  • where the data lives on disk (path to the zarr store or netCDF files)
  • when it was created
  • whether it has been published

Multiple artifacts can exist for the same dataset template if data was ingested at different times (they form the version history). The most recent artifact for a given dataset_id is what the public API serves.

Artifacts are stored in {data_dir}/artifacts/records.json, where data_dir is the path configured in climate-api.yaml. This is an internal implementation detail — consumers should never depend on artifact IDs or artifact paths directly.

Managed dataset

A managed dataset is the public-facing view of the most recent artifact for a given template. It is what /datasets, /zarr, /stac, and /ogcapi expose. When an operator ingests or syncs a dataset, the managed dataset view updates to reflect the new artifact — the public ID stays stable.

The relationship is: one template → many artifacts over time → one managed dataset (the latest).

Extent

The extent is the spatial bounding box configured for this Climate API instance. It is set once in climate-api.yaml and does not change at runtime. Every ingestion is automatically scoped to this extent — operators do not specify it per-request.

This is a deliberate design constraint: each instance serves one place. A Sierra Leone instance serves Sierra Leone. Multi-country coverage requires multiple instances.


Data lifecycle

Template (YAML)
    │  POST /ingestions  (or  POST /sync)
Ingestion
    │  call ingestion function → NetCDF files on disk
    │  apply transforms
    │  reproject to instance CRS
    │  write GeoZarr store
    │  compute coverage (spatial + temporal extent of actual data)
Artifact (internal record)
    │  publish=true
Managed dataset (public API)
    ├── /datasets/{id}         — native metadata
    ├── /zarr/{id}             — raw zarr store access
    ├── /stac/collections/{id} — STAC discovery
    └── /ogcapi/collections/{id} — OGC API access

The ingestion function is called identically by both POST /ingestions and POST /sync — the framework invokes it the same way regardless of the trigger. A correctly written ingestion function works for both without any changes.

The framework is responsible for everything from "write zarr" onward. An ingestion function only needs to write NetCDF files to a given directory. The framework then:

  1. reads and normalises the coordinate names
  2. applies transforms (unit conversion, etc.)
  3. reprojects to the instance CRS
  4. builds the zarr store with auto-computed chunking
  5. writes GeoZarr root attributes (spatial:bbox, proj:code) so map clients can position tiles
  6. computes artifact coverage (spatial bounds + time range) from the written data
  7. stores the artifact record
  8. publishes the managed dataset through pygeoapi if publish=true

This division means that ingestion functions do not need to know about zarr conventions, STAC, OGC, or pygeoapi. They write data files; the framework handles everything else.


Sync kinds

The sync.kind field in a template determines how a managed dataset is kept current.

sync.kind On each sync Use when
temporal Append new time steps, or rematerialize Historical record that grows over time (CHIRPS, ERA5-Land)
release Rematerialize if a newer release exists Versioned releases where each year/version is discrete (WorldPop)
static Never synced One-time fixed dataset with no updates

The sync execution modes

Within sync.kind: temporal, two execution modes control what happens when new data is available:

  • append — downloads only the missing time range and appends it to the existing artifact
  • rematerialize — discards the existing artifact and rebuilds it from scratch

append is efficient for large historical datasets (avoid re-downloading years of data on each sync). rematerialize is appropriate when old data may change retroactively (e.g. reanalysis products that are corrected after the fact).

Availability clamping

Providers publish data on a delay. The sync.availability block in a template tells the sync engine how far back from today data is reliably available:

sync:
  kind: temporal
  execution: append
  availability:
    latest_available_function: climate_api.providers.availability.lagged_latest_available
    lag_hours: 120

Before executing a sync, the engine calls the availability function to clamp the target end date. This prevents requesting data that has not yet been published, which would leave temporal gaps.


The plugin contract

The platform has four extension points. Each one has a narrow contract — the framework handles everything else automatically.

Ingestion function

def download(
    *,
    start: str,       # ISO 8601 date or datetime
    end: str,
    dirname: Path,    # write output files here
    prefix: str,      # use as filename prefix, e.g. f"{prefix}_{year}.nc"
    overwrite: bool,
    bbox: list[float],  # optional — only if the source needs a spatial filter
    **kwargs,           # default_params from the YAML template
) -> None:
    # Write one or more NetCDF files to dirname.

The function writes NetCDF files. The framework reads them, normalises coordinate names, applies transforms, reprojects to the instance CRS, builds the zarr, writes GeoZarr attributes, computes coverage, and registers the artifact.

The ingestion function is called identically by POST /ingestions and POST /sync. The caller makes no difference to the function — it always receives the same parameters.

Reusing ingestion logic across templates: multiple YAML templates can reference the same Python function and differentiate via default_params. This is the intended pattern for sources that have the same fetching logic but expose different variables:

# era5land_temperature_hourly.yaml
ingestion:
  function: dhis2eo.data.era5_land.download
  default_params:
    variable: 2m_temperature

# era5land_precipitation_hourly.yaml
ingestion:
  function: dhis2eo.data.era5_land.download
  default_params:
    variable: total_precipitation

No framework changes are needed to support a new variable from the same source.

Transform function

def my_transform(ds: xr.Dataset, dataset: dict) -> xr.Dataset:
    # Receive the dataset after download, return a modified dataset.
    # Modify ds[dataset["variable"]] values and variable attributes.
    # Do not modify dataset-level ds.attrs — the framework manages those.

Transforms are applied in order after the ingestion function returns, before the zarr is written. They receive the full xarray Dataset and the template dict. They return a modified Dataset. They do not write to disk.

Process execution function

def execute(*, source_dataset_id: str, **kwargs) -> dict:
    # Run a named operation (e.g. temporal resampling).
    # Return a JSON-serialisable result dict.

Processes are named operations triggered via POST /processes/{id}/execution. They are broader than single-dataset transforms — they can read one managed dataset and produce another (e.g. daily → monthly aggregation).


The transform pipeline

Transforms are applied at a consistent point in the ingestion lifecycle:

  1. ingestion function writes raw NetCDF files to disk
  2. framework reads and normalises the data into an xarray Dataset
  3. _run_transforms(ds, dataset) applies each declared transform in order
  4. result is reprojected to instance CRS
  5. zarr store is written with auto-computed chunking
  6. framework writes GeoZarr root attributes
  7. framework computes coverage from the zarr

Transforms see post-download, pre-reproject data. They should only modify data values and variable-level attributes. The framework writes dataset-level attributes (GeoZarr) after the transform pipeline completes.


GeoZarr root attributes

Every zarr artifact must have GeoZarr root attributes for map rendering to work correctly. These are written into zarr.json at the store root:

  • spatial:bbox[xmin, ymin, xmax, ymax] in the native CRS
  • proj:code — the CRS EPSG code (e.g. EPSG:32633 for UTM, EPSG:4326 for WGS84)
  • zarr_conventions — GeoZarr convention declaration

The map viewer reads spatial:bbox and proj:code to determine where to position tiles on the map.

The framework writes these attributes — plugins do not. They are written in build_dataset_zarr after transforms and reprojection, using the actual coordinate bounds of the final written data and the instance CRS.


CRS handling

The instance CRS is configured in climate-api.yaml:

extent:
  name: Norway
  bbox: [3.0, 57.0, 32.0, 72.5]
  crs: EPSG:32633 # optional; defaults to EPSG:4326

Downloaded data is reprojected from the source CRS (source_crs in the template, default EPSG:4326) to the instance CRS during ingestion. The stored zarr is always in the instance CRS.

If no crs is set in the config, data is stored in EPSG:4326 (WGS84). This is the correct default for instances that do not need a metric CRS.


Artifact deduplication and version history

When a new ingestion request arrives, the framework checks whether an existing artifact already covers the requested scope:

  • same dataset_id
  • same bbox (from the configured extent)
  • overlapping time range

If a match exists and overwrite=false, the existing artifact is returned without re-downloading. If overwrite=true, the existing artifact is replaced.

The artifact store keeps the full history of records for sync deduplication and provenance. Old artifacts are not deleted automatically. For long-running instances, records.json grows over time. The long-term direction is a proper transactional store, but for the current scale (tens of artifacts per instance) a JSON file is adequate.


What the framework guarantees

Plugin code (ingestion functions, transforms, processes) can rely on the following being handled automatically by the framework:

Concern Where handled
Coordinate name normalisation (laty, etc.) build_dataset_zarr
Reprojection to instance CRS reproject_to_instance_crs
Zarr chunking (auto-sized from extents.temporal.resolution) _compute_time_space_chunks
Multiscale pyramid generation (when dims > 2048×2048) build_dataset_zarr
GeoZarr root attributes (spatial:bbox, proj:code) build_dataset_zarr
Artifact coverage computation _coverage_from_dataset
Artifact record persistence _store_artifact
pygeoapi publication publish_artifact_record if publish=true
STAC collection generation Dynamic from artifact record

Plugin code only needs to produce data files. Everything else is the framework's responsibility.


Consequences of design choices

Single extent per instance

Each instance is configured for one place. This keeps the data model simple (no per-artifact extent tags) and the zarr stores small (country-scale downloads rather than global). The trade-off is that a national ministry with sub-national data needs either runs multiple instances or configures a single instance at national extent.

Temporal gaps are not allowed

The sync engine validates that new data connects to the end of the existing artifact before appending. If a gap exists, the sync fails rather than silently producing a dataset with a hole. This is a deliberate constraint: downstream consumers (DHIS2, CHAP) depend on continuous time series and should not receive data with silent gaps.

The append execution mode avoids re-downloading history

append downloads only the missing range and rebuilds the full zarr from all cached files. This means the local cache (NetCDF files in data/downloads/) is the source of truth for the full time series; the zarr is a derived view. If the cache is deleted, a rematerialize is required to recover.

Transforms run after download, before reproject

Transforms see raw downloaded values in the source CRS and source units. The order is: download → transform → reproject → write zarr.