Implementation Status¶

Purpose¶

This note captures the current implementation state of the branch after the API consolidation around ingestions, datasets, extents, raw Zarr access, STAC discovery, and pygeoapi publication.

It is intended to answer:

what the main branch now exposes
what is intentionally internal
how the current pieces fit together
what remains to be refined

Current API surface¶

The main branch now centers on one narrow vertical slice:

define dataset templates in the Climate API registry
define configured extents for the Climate API instance
ingest data into a managed dataset for one dataset template plus one extent
publish that managed dataset through pygeoapi under /ogcapi
expose native metadata under /datasets, STAC discovery under /stac, and raw Zarr access under /zarr
sync existing managed datasets forward through /sync

The public surface is intentionally small:

/ingestions
/extent
/datasets
/stac/...
/zarr/{dataset_id}
/sync/{dataset_id}
/ogcapi/...

Main Code References¶

src/climate_api/main.py
app assembly and router mounting
src/climate_api/ingestions/routes.py
ingestion, dataset, zarr, and sync routes
src/climate_api/ingestions/services.py
internal artifact persistence, dataset grouping, sync service wiring, Zarr browsing
src/climate_api/ingestions/sync_engine.py
sync planning and execution engine
src/climate_api/ingestions/schemas.py
public ingestion, dataset, and sync contracts
src/climate_api/providers/availability.py
provider-specific sync availability policies
src/climate_api/extents/routes.py
extent discovery endpoint
src/climate_api/extents/services.py
extent registry backed by CLIMATE_API_CONFIG
src/climate_api/publications/services.py
pygeoapi publication and stable managed dataset id logic
extent: block in climate-api.yaml (CLIMATE_API_CONFIG)
configured spatial extent for this Climate API instance

What Was Achieved¶

1. Public ingestion contract now uses `extent_id`¶

POST /ingestions now takes:

dataset_id
start
end
extent_id
overwrite
prefer_zarr
publish

Raw bbox and country_code are no longer part of the public ingestion payload.

The route resolves extent_id inside Climate API and then calls the downloader with concrete spatial inputs.

2. Public ingestion responses now return datasets, not artifacts¶

POST /ingestions, GET /ingestions, and GET /ingestions/{ingestion_id} now define the operational ingestion surface.

POST /ingestions and GET /ingestions/{ingestion_id} return:

ingestion_id
status
dataset

The dataset field uses the public dataset summary model from /datasets, not the full dataset detail view with version history.

Internal artifact records still exist, but they no longer define the public response story.

GET /ingestions lists ingestion run records for admin and operational use. /datasets remains the canonical managed-data surface for consumers.

3. Extents are now a first-class read-only part of the native API¶

The branch exposes:

GET /extent

Extents are configured in YAML and currently include:

extent_id
name
description
bbox

This keeps spatial configuration explicit without turning it into a runtime write API.

4. `/datasets` is now the native managed-data catalog¶

GET /datasets returns a public dataset catalog envelope:

kind
items

Each dataset item includes:

public dataset id
source dataset template id
dataset metadata from the registry
current extent
last updated timestamp
public links
publication status

The public dataset response no longer exposes internal artifact ids, artifact counts, filesystem paths, or downloader implementation details.

5. Raw Zarr access is now canonical under `/zarr/{dataset_id}`¶

The raw data surface is:

GET /zarr/{dataset_id}
GET /zarr/{dataset_id}/{relative_path}

The public Zarr listing response now avoids leaking internal artifact ids and raw filesystem roots. It returns:

kind
dataset_id
path
entries

Entry links point back into the canonical /zarr/{dataset_id}/... namespace.

6. STAC is now the public discovery surface for published Zarr datasets¶

The branch exposes a dedicated STAC surface under:

/stac
/stac/catalog.json
/stac/collections/{dataset_id}

Published Zarr-backed managed datasets appear there as one STAC Collection per dataset. The zarr asset points to the canonical native /zarr/{dataset_id} route.

xstac derives Datacube metadata from the opened Zarr-backed dataset, while the Climate API service layer remains responsible for publication filtering, link construction, and Zarr asset metadata.

Current STAC details:

pyramid Zarr stores (detected by the presence of a 0/ level on disk) expose /zarr/{dataset_id}/0 as the canonical asset href
temporal extents are normalized to RFC 3339 in both STAC and Datacube temporal extent fields
STAC collection license currently defaults to various
spatial step values are rounded for readability while preserving axis direction
an opt-in live interoperability smoke test exists at tests/integration/test_stac_interop.py

7. pygeoapi remains the OGC query and coverage surface¶

Published datasets are exposed through:

/ogcapi/collections
/ogcapi/collections/{dataset_id}
/ogcapi/collections/{dataset_id}/coverage

From the native FastAPI side, dataset responses include publication state and links to the OGC collection, but the collection resource itself is only public under /ogcapi.

8. Internal artifacts still exist as a storage/provenance model¶

The branch still persists internal artifact records in data/artifacts/records.json.

Those internal records retain:

exact request scope
stored format
creation time
publication mapping
deduplication and sync history inputs

This internal model remains necessary for provenance and sync behavior, but it is no longer a public API concept.

The current JSON-backed store is still an interim persistence layer. Record mutations now use file locking to avoid lost updates during concurrent writes, but the long-term direction should be a proper transactional store.

9. `/sync` is now a testable managed dataset update path¶

The sync API now exposes:

GET /sync/{dataset_id}/plan?end={period}
POST /sync/{dataset_id}

The plan endpoint returns a dry-run SyncDetail without downloading or writing data. The post endpoint executes the same plan through the existing artifact creation path when work is required.

Implemented sync behavior:

temporal datasets can append missing periods
release datasets rematerialize when a newer requested release exists
static datasets return not_syncable
provider availability policies clamp unsafe future targets before execution
append V1 downloads only the missing range, then rebuilds the canonical artifact from local cache
Zarr materialization clips cached upstream data to the requested artifact scope
artifact reuse ignores records whose stored coverage does not match the requested scope
newly materialized artifacts are rejected when realized temporal coverage does not match the requested scope

How The Current Flow Works¶

Ingestion¶

client submits dataset_id, start, optional end, and optional extent_id
Climate API resolves the dataset template from the registry
Climate API resolves extent_id to a concrete bbox or other configured spatial input
Climate API checks for an existing matching internal artifact
if needed, Climate API downloads the source data
Climate API prefers Zarr materialization and falls back to NetCDF when needed
Climate API computes realized coverage metadata
Climate API stores an internal artifact record
if publish=true, Climate API publishes the dataset through pygeoapi
the route returns the public managed dataset view

Dataset publication¶

publication derives a stable managed dataset id
pygeoapi resources are regenerated from published internal artifacts
STAC collection documents are derived dynamically from the same published artifact state
the mounted pygeoapi sub-application is refreshed in process
the dataset becomes available immediately under /stac/collections/{dataset_id} and /ogcapi/collections/{dataset_id}

Raw data access¶

/datasets/{dataset_id} exposes native metadata and version summary
/stac/collections/{dataset_id} exposes standards-friendly discovery metadata for direct Zarr-opening clients
/zarr/{dataset_id} exposes the raw Zarr store layout when the latest version is Zarr-backed
/ogcapi/collections/{dataset_id}/coverage exposes standards-facing coverage access

Sync¶

GET /sync/{dataset_id}/plan resolves the latest local artifact and source template
sync_engine.plan_sync(...) computes the action, target, and delta range
provider availability metadata clamps unsupported future targets
POST /sync/{dataset_id} returns up_to_date or not_syncable without writes when applicable
otherwise, sync calls the existing artifact creation path
the new version is optionally published under the same stable managed dataset id

Current Public Surface¶

Native FastAPI¶

POST /ingestions
GET /ingestions
GET /ingestions/{ingestion_id}
GET /extent
GET /datasets
GET /datasets/{dataset_id}
GET /datasets/{dataset_id}/download
GET /zarr/{dataset_id}
GET /zarr/{dataset_id}/{relative_path}
POST /sync/{dataset_id}
GET /sync/{dataset_id}/plan

Standards-facing¶

GET /stac
GET /stac/catalog.json
GET /stac/collections/{dataset_id}
GET /ogcapi/collections
GET /ogcapi/collections/{dataset_id}
GET /ogcapi/collections/{dataset_id}/coverage

What Is Still Deferred¶

a final decision on how much version history to expose publicly
richer extent configuration shapes beyond id + bbox + optional metadata
any runtime write API for extents
multi-version publication resolution behind one dataset id
true in-place Zarr append, if storage semantics require it later
upstream dhis2eo improvements so provider download boundaries can respect partial months directly

Short Summary¶

The branch now presents a much cleaner product story:

run ingestions through /ingestions as an execution and admin surface
return datasets, not artifacts
discover managed data under /datasets
discover published Zarr-backed datasets under /stac/catalog.json
access raw Zarr under /zarr/{dataset_id}
sync managed datasets through /sync/{dataset_id}
use /ogcapi for standards-facing query and coverage access

Internal artifacts still exist, but only as a storage and provenance model.