Skip to content

ADR 0002 — Seed data strategy

  • Status: Accepted
  • Date: 2026-05-04
  • Decision drivers: FR-18 (indicator catalogue), FR-19 (sector factor maps), FR-20 (regime taxonomy), the 200 k-node AuraDB Free ceiling, and the goal that anyone cloning the repo can stand up a useful graph in one command.

Context

The graph schema (Phase 0) is empty. Before any ingestion adapter can run, the graph needs a static spine: the GICS sector hierarchy, the macro regime taxonomy, the indicator catalogue, the sector-factor map, and a starting universe of companies. These are not derived from any external feed in real time — they are taxonomies and curated relationships that change rarely. They belong in version control.

Two open questions:

  1. Where do these seeds live? In the repo (YAML/CSV) or in a managed external service?
  2. Where do per-company facts that do drift (CIK, share class details, sector reclassifications) come from?

Decisions

1. Static seeds live in schema/seed/ as YAML + CSV

Five files, plain text, version-controlled, reviewable:

File Purpose Rows
gics_sectors.yaml 11 GICS top-level sectors plus the industry-groups / industries / sub-industries referenced by sector_maps ~58
regimes.yaml 6 macro regime archetypes plus the 2 currently-active hybrid regimes 7
indicators.yaml The factor universe — ~50 indicators across 8 categories 54
sp100_tickers.csv S&P 100 bootstrap: ticker → name → top-level sector. CIK is not in the file 100
sector_maps.yaml 7 priority sectors × ~10 edges each: indicator → sector with edge type, sign, magnitude, lag, regime gate 68 edges

Reasons: YAML diffs read cleanly in PR review; CSV is the obvious format for a flat ticker list; everything is plain text so the source of any value is git blame-able. Alternative considered: a managed Notion / Airtable mirror — rejected because it adds an external dependency and breaks reproducibility.

2. CIK is resolved at load time, not stored in the seed

The SP-100 CSV intentionally omits CIK. CIKs are resolved against EDGAR's company_tickers.json by the ingestion adapter (Phase 2). Reasons: CIK assignments occasionally change with corporate actions, and embedding them in the seed creates silent staleness. The ticker is the durable identity in the seed; the loader treats CIK as a property to be filled in later.

3. Sector edges use a synthetic key for idempotent MERGE

sector_maps.yaml allows the same (indicator, sector, edge_type) tuple to appear twice with different regime_conditional lists and opposite signs — for example, Fed Funds Rate → Financials is + in REFLATION/GOLDILOCKS but in SLOWDOWN/CRISIS. Because Neo4j relationship MERGE keys on (start, type, end, props-in-MERGE-clause), we encode the natural composite key — f"{indicator}|{sector}|{type}|{regimes}|{sign}" — into a key property and MERGE on that. Re-running seed apply updates rationale/magnitude/lag without duplicating edges.

4. The seed loader has a validate mode that does not require Neo4j

market-view seed validate parses every file, cross-references every FK (sector parent chain; indicator IDs in sector_maps; regime IDs in co_active_with and regime_conditional; ticker sectors against the GICS top-level), checks every enum, and exits non-zero on any error. CI runs this on every push. This catches typos before they hit the database.

5. The seed is bounded by the AuraDB Free ceiling

Total nodes contributed by the seed: 11 sectors + ≈47 sub-sectors + 7 regimes + 54 indicators + 100 companies ≈ 220 nodes. Edges: 100 BELONGS_TO + ≈47 CHILD_OF + 2 CO_ACTIVE_WITH + 68 sector-map ≈ 220 edges. We expect the seed to consume <0.5% of the AuraDB Free 200 k/400 k caps, leaving room for ingested time-series fact nodes, events, and documents in later phases.

Consequences

  • Adding a new indicator is a one-line YAML change, plus any sector_maps rows that reference it. PR review is the validation surface.
  • Renaming an indicator ID requires a coordinated change across indicators.yaml and every sector_maps row that references it. The validator catches stale references; we do not yet have a renamer.
  • Sector scope is intentionally limited to seven sectors in sector_maps.yaml. Adding the remaining four (Materials, Consumer Staples, Utilities, Real Estate) is a Phase-2 follow-up and tracked as a TODO.
  • The seed is stateful in the loader's sense, not the schema's sense. Deleting a row from a seed file does not delete the corresponding node from Neo4j on the next seed apply — MERGE only adds. Removing entities requires a one-shot deletion script, which we will write only when needed.

Open questions deferred

  • Whether to extend gics_sectors.yaml to the full GICS taxonomy (~150 sub-industries). For now we only seed the levels that something else references.
  • Whether the S&P 100 list should be auto-refreshed against current OEX membership via a scheduled workflow. Yes eventually; out of scope for Phase 1.
  • Sub-industry assignment per ticker. Currently flat to top-level sector; per-company sub-industry can be enriched from EDGAR's sicDescription field at ingest time.