Infrastructure Dashboard
The Infrastructure Dashboard is an admin page at /admin/infrastructure that provides a live health view of the platform’s GCP resources. It auto-detects the current environment and polls GCP APIs every 60 seconds.
For background on the underlying infrastructure, see Infrastructure Overview.
At a Glance
| Section | Data source | What it shows |
|---|
| Cloud Run Services | Cloud Run Admin API v2 | Health status, instance count, traffic %, latest revision |
| GKE Clusters | GKE Container API | Cluster health, node count, ready nodes, master version |
| Tenant Workloads | Kubernetes API (GKE) | Per-tenant namespace, deployment, ready/desired pod count, HTTPRoute hostnames |
| IAM & Service Accounts | GCS manifest | Declared service accounts, their roles, and scope (shared / per-environment / per-tenant) |
| Cloud Tasks Queues | Cloud Tasks API | Queue health, pending task count, oldest task age |
All sections degrade gracefully — a failed GCP API call returns null for that section only; other sections continue to render.
Architecture
InfrastructurePage (Server Component)
│ resolveEnvironment(process.env)
│ → EnvironmentContext { id, label, projectId, region, manifestKey }
▼
InfrastructureDashboard (Client Component)
│ useInfrastructureStatus(env)
│ → GET /api/infrastructure/status?env={id} (every 60s)
▼
/api/infrastructure/status (Next.js API Route)
│ auth: GOOGLE_FIREBASE_ADMIN_KEY SA
├── GCS: gs://shokunin-480309-tfstate/infra-manifest/{env}.json (declared state)
├── Cloud Run Admin API v2 (live service status)
├── GKE Container API (live cluster status)
└── Cloud Tasks API (live queue stats)
Environment detection
The page detects which GCP environment it is serving by reading NEXT_PUBLIC_FIREBASE_DATABASE_ID at request time (Server Component):
FIREBASE_DATABASE_ID value | Resolved environment | Manifest key |
|---|
"production" | Production | production |
"staging" | Staging | staging |
"dev-*" or unset | Dev-shared | dev-shared |
The environment label is displayed in the EnvironmentBanner component at the top of the page.
API route authentication
GET /api/infrastructure/status authenticates to GCP using the GOOGLE_FIREBASE_ADMIN_KEY service account JSON (also used for Firebase Admin SDK operations). It falls back to GOOGLE_SERVICE_ACCOUNT_KEY when the Firebase admin key is absent (local dev).
The Firebase Admin SA requires three additional IAM roles beyond its Firebase permissions:
| Role | Used for |
|---|
roles/run.viewer | List Cloud Run services and fetch revision/instance metadata |
roles/container.viewer | Read GKE cluster status and node pool details |
roles/cloudtasks.viewer | Query Cloud Tasks queue depth and state |
These roles are granted by terraform/platform/modules/iam/main.tf and applied as part of the dev-shared, staging, and production Terraform environments.
GCS Infrastructure Manifest
Every Terraform environment root writes an infra-manifest/{key}.json file to gs://shokunin-480309-tfstate/ on every apply. The dashboard reads this file to populate the “declared” side of each section — the intended resource configuration before live status is overlaid.
| Manifest key | Written by | Contents |
|---|
dev-shared | platform/environments/dev-shared/manifest.tf | GKE cluster, Gateway, Provisioner URL/queue, SA emails, secret IDs |
staging | platform/environments/staging/manifest.tf | Dolt VM, Beads API URL, SA emails |
production | platform/environments/production/manifest.tf | Dolt VM, Beads API URL, SA emails |
tenant-{id} | terraform/tenants/manifest.tf | Namespace, HTTPRoute names, agent/opencode endpoint URLs, workshop GSA email |
The GcpInfraManifest TypeScript type in domains/infrastructure/types.ts defines an array-based schema (cloudRunServices[], gkeClusters[], queues[], serviceAccounts[]). The Terraform manifest files currently write a flat nested structure — aligning them to the typed schema is in progress. Until then, declared-resource sections fall back to empty arrays and the dashboard displays live GCP API data only.
Dashboard Components
All components live in app/(platform)/admin/infrastructure/components/.
| Component | Props | Description |
|---|
EnvironmentBanner | environment, fetchedAt, onRefresh, isRefreshing | Sticky banner showing environment label, last-fetch timestamp, and a manual refresh button |
CloudRunCard | declaration, status | Animated arc gauge showing instance count; health pill; revision/traffic details |
GkeClusterCard | declaration, status | Node count, ready node count, Kubernetes master version, raw status |
TenantWorkloadsTable | workloads | Table of all tenant namespaces with deployment name, ready/desired pod count, and HTTPRoute hostname chips |
IamSection | serviceAccounts | Collapsible table of declared service accounts, their email addresses, scope badge, and IAM roles |
CloudTasksCard | declaration, status | Queue health, pending task count, oldest task age |
ResourceSkeleton | — | Loading skeleton variants matching each card’s dimensions |
StatusPulse | status | Animated pulsing dot for health state (healthy / degraded / unreachable / loading) |
Health states
All live GCP resources report one of four health states:
| State | Meaning |
|---|
healthy | Resource available and operating normally |
degraded | Resource exists but is experiencing issues (e.g. Cloud Run CONDITION_FAILED) |
unreachable | Resource could not be reached or the GCP API call failed |
loading | Status not yet fetched (client-side initial render only) |
Domain Layer
The Infrastructure Domain (domains/infrastructure/) provides the data layer for the dashboard.
GcpStatusRepository
Provider-agnostic interface (repositories/gcp-status-repository.ts):
interface GcpStatusRepository {
getStatus(env: EnvironmentContext): Promise<InfrastructureStatus>;
}
The HTTP implementation (repositories/http/gcp-status-repository.ts) calls GET /api/infrastructure/status from the browser. Follow the factory function pattern established in domains/workshop/repositories/ when adding new implementations.
useInfrastructureStatus
const { status, loading, error, refresh } = useInfrastructureStatus(env);
Fetches on mount, auto-refreshes every 60 seconds, clears the interval on unmount. Manages loading and error state separately so the dashboard can show stale data alongside an error banner.
resolveEnvironment
Pure function (utils/environment-resolver.ts) — accepts a process.env snapshot and returns EnvironmentContext. Fully unit-tested via environment-resolver.test.ts.
manifest-parser.ts
Pure stateless utility functions for querying a GcpInfraManifest — getCloudRunDeclarations, getServiceAccounts, etc. No I/O, no side effects, fully unit-testable.
Adding a New Dashboard Section
- Add a new declaration type to
domains/infrastructure/types.ts
- Add the live status type to the same file
- Add the declaration to
GcpInfraManifest and the status to InfrastructureStatus
- Add a fetch function in
app/api/infrastructure/status/route.ts (follow the per-source try/catch pattern)
- Add the Terraform manifest field in all three environment
manifest.tf files
- Create a new card component in
app/(platform)/admin/infrastructure/components/
- Wire it into
InfrastructureDashboard