Research & Decisions¶
Canonical decision log for the AI Workloads Platform. Each entry records the decision, rationale, and alternatives considered.
R1: Provider Usage API Patterns¶
Decision: Use cursor-based incremental polling with per-provider adapters behind a shared Connector Protocol.
Rationale: Each provider exposes usage data differently:
| Provider | API | Token handling | Key type |
|---|---|---|---|
| OpenAI | /v1/organization/usage |
Bucket-based hourly aggregates per model. All input tokens are uncached. | Admin API key (sk-admin-) |
| Anthropic | /v1/messages logs via admin API |
Per-request records with three-way token split: input_tokens (uncached), cache_creation_input_tokens, cache_read_input_tokens |
Admin key with usage:read scope |
| OpenRouter | /api/v1/usage |
Per-request or aggregated usage. Tokens treated as uncached. | Standard API key |
A Connector Protocol with validate_key(), poll_usage(), and map_tokens() methods abstracts these differences. Each connector stores a sync_cursor (last-seen timestamp or page token) for incremental retrieval.
Alternatives considered
- Unified SDK wrapper: No single SDK covers all three providers' admin/usage APIs.
- Webhook-based ingestion: None of the providers offer usage webhooks. Polling is the only option.
R2: Idempotency and Deduplication Strategy¶
Decision: Coarse idempotency hash (SHA-256(provider:org_id:model:bucket_start_hour)) with upsert (last-write-wins).
Rationale: Provider APIs return aggregated usage per time bucket. If the same bucket is re-polled (network retry, reconciliation pass), the token counts may differ due to late-arriving data on the provider side. Using a reject-on-collision strategy would discard legitimate updates. Upsert with last-write-wins naturally handles reconciliation - the latest poll always has the most complete data.
The hash is coarse (per-model per-hour per-org) because providers aggregate at this granularity. Finer hashes would create phantom duplicates when bucket boundaries shift.
Alternatives considered
- Per-request deduplication (fine-grained hash): Would require providers to expose stable request IDs, which not all do.
- Reject-on-collision: Would block reconciliation updates and require manual intervention.
R3: Emissions Calculation Pipeline¶
Decision: Three-phase calculation (prefill → decode → cached) with versioned CarbonFactors lookup table.
Rationale: The pipeline:
- Tokens → Energy (J):
energy = tokens × energy_per_token_j. Different rates for prefill (input processing), decode (output generation), and cached (cache reads at ~10% of prefill energy). - Energy → kWh:
kwh = energy_j / 3_600_000 - kWh → CO2 (kg):
co2 = kwh × grid_intensity_kg_per_kwh × PUE
CarbonFactors is a versioned lookup table with model-to-tier mapping via fnmatch globs (e.g., gpt-4* → tier_3, claude-3-haiku* → tier_1). Each factors version is immutable - new versions are appended, never mutated. Calculations reference the factors_version used, enabling reproducibility.
Uncertainty bounds are calculated as ±(uncertainty_pct / 100) × co2_kg to communicate estimation confidence.
Alternatives considered
- Real-time hardware telemetry: Not available from API providers - we estimate based on published model architectures and data center PUE values.
- Per-request energy metering: Not offered by any provider. Token-based estimation is the industry standard (IEA, AI carbon footprint papers).
R4: Receipt Signing and Verification¶
Decision: Ed25519 (via PyNaCl) with SHA-256 payload hash, key versioning, and public verification endpoint.
Rationale: Ed25519 is fast (signing < 1ms), produces compact signatures (64 bytes), and is widely supported.
Signing flow:
- Serialize receipt payload as canonical JSON (sorted keys, no whitespace)
- SHA-256 hash the payload →
payload_hash - Sign
payload_hashwith Ed25519 private key →signature - Store signature, payload_hash, key_version, and public_key hex on the receipt
Key versioning (key_version integer on each receipt) supports key rotation without invalidating old receipts. The public verification endpoint returns the receipt metadata + signature + public key so anyone can verify independently using any Ed25519 library.
Alternatives considered
- RSA-2048: Slower signing, larger signatures, no advantage for this use case.
- HMAC: Not suitable - verification requires the shared secret, defeating public verifiability.
- Blockchain/on-chain: Over-engineered for this scale; adds cost and complexity without user benefit.
R5: Billing Period Lifecycle¶
Decision: State machine (open → closing → closed → failed) with T+48h deferred receipt generation.
Rationale: The billing period lifecycle prevents race conditions between late-arriving telemetry and receipt generation:
| State | Description |
|---|---|
open |
Telemetry events accumulate. Normal state during the month. |
closing |
Triggered by invoice.payment_succeeded webhook. A 48-hour window opens for reconciliation. |
closed |
After T+48h, the system aggregates CO2, retires credits, signs the receipt, generates the PDF. No further changes to this period. |
failed |
Payment failed or insufficient credit inventory. No credits retired, user notified. |
Late-arriving telemetry after period close is recorded as a PriorPeriodAdjustment entry linked to the next open period.
Alternatives considered
- Immediate receipt on payment: Would miss 6-24 hours of late-arriving provider data, causing receipt inaccuracy.
- Longer reconciliation window (T+7d): Delays receipt delivery, reduces user trust.
R6: Background Job Architecture¶
Decision: ARQ (async Redis queue) with dedicated worker process, not Celery.
Rationale: ARQ is async-native (runs on asyncio), lightweight, and uses Redis as its only dependency. Since the entire application is async (FastAPI + asyncpg + httpx), ARQ fits naturally. Celery would require a synchronous worker or celery-pool-asyncio hacks.
Job types:
| Job | Trigger | Description |
|---|---|---|
poll_all_connections |
Hourly cron | Iterates active connections, polls each via connector, ingests telemetry |
trailing_reconciliation |
Daily at 03:00 UTC | Re-polls the last 24h window to catch late data |
close_billing_period |
T+48h after invoice.payment_succeeded |
Aggregates, retires, signs, generates PDF |
generate_audit_pack |
Monthly on 3rd | Bundles all receipts for the prior month into a zip |
Each job is idempotent. Failed jobs retry with exponential backoff (ARQ built-in). Permanent failures (5+ consecutive for a connection) disable the connection and alert.
Alternatives considered
- Celery: Synchronous by default, heavier dependency (RabbitMQ or Redis + kombu), doesn't align with async stack.
- In-process BackgroundTasks: Blocks the event loop for CPU-bound work (PDF generation), no retry/scheduling.
- AWS Lambda: Adds deployment complexity, cold starts, separate codebase for workers.
R7: Authentication Architecture¶
Decision: Clerk JWT verification via JWKS endpoint, no local user table.
Rationale: Clerk manages user signup, login, MFA, and org membership. The API validates JWTs by fetching Clerk's JWKS (cached with TTL). The JWT contains org_id and user_id claims. The API maps org_id to its own Organization table for billing/connection/telemetry scoping.
No local user table is needed - user profile data stays in Clerk. The API only needs org membership for authorization.
Alternatives considered
- Self-hosted auth (e.g., Authlib + local users): More code to maintain, no MFA/SSO out of the box.
- Supabase Auth: Tighter coupling to Supabase; Clerk offers better org/team management.
R8: PDF Generation¶
Decision: WeasyPrint + Jinja2 HTML template, rendered in the ARQ worker.
Rationale: WeasyPrint renders HTML/CSS to PDF with full CSS3 support. The receipt template is a Jinja2 HTML file with branded styling. The worker renders the PDF and uploads to S3 (or serves directly). This runs in the worker process because WeasyPrint is CPU-bound and would block the API event loop.
System dependencies: pango, cairo, gdk-pixbuf (installed via apt in Docker).
Alternatives considered
- Puppeteer/Playwright: Requires a headless browser - much heavier runtime than WeasyPrint.
- ReportLab: Programmatic PDF construction is tedious for branded layouts. HTML/CSS is faster to iterate.
- External PDF service (e.g., DocRaptor): Adds external dependency, latency, and cost.
R9: Data Retention and Cleanup¶
Decision: 24-month retention for telemetry events, permanent retention for receipts and audit packs.
Rationale: Telemetry events are high-volume and grow linearly with usage. 24 months covers typical compliance reporting windows (CSRD annual reports). Receipts and audit packs are legal documents and must be retained permanently (or per org policy).
A monthly cleanup job archives telemetry events older than 24 months to cold storage (S3 Parquet export) then deletes from PostgreSQL. This is a Phase 2 concern - not in the initial launch.
Alternatives considered
- No retention limit: Database grows unbounded, query performance degrades.
- 12-month retention: Too short for CSRD annual reporting needs.