Infrastructure Overview
This page provides a conceptual overview of the Shokunin Platform’s cloud infrastructure. It is intended to help contributors understand the deployment topology without requiring access to vendor dashboards or credentials.
For operational procedures, deployment steps, or secret management, refer to terraform/README.md in the repository.
Infrastructure at a Glance
The Shokunin Platform uses two hosting providers:
| Provider | Role |
|---|
| Vercel | Hosts the Next.js frontend (serverless, edge-distributed) |
| GCP (Google Cloud Platform) | Provides data persistence, compute, storage, and AI model access |
Architecture Diagram
The diagram below shows how the major components connect. It is derived from terraform/README.md.
Component Overview
Vercel — Frontend Hosting
The Next.js app (v0-shokunin-ai-platform) is deployed to Vercel. Vercel provides:
- Serverless Next.js runtime (App Router, API routes)
- Global CDN for static assets
- Preview deployments for pull requests
The frontend communicates with GCP services via HTTPS — it never talks to Dolt directly.
GCP Cloud Run — Beads API
The Beads API runs on Cloud Run and bridges HTTPS requests (from Vercel and the bd CLI) to the Dolt MySQL database over the private VPC.
- Containerized; image stored in GCP Artifact Registry
- Authenticates inbound requests via
X-Beads-API-Key header
- Egresses to the Dolt VM via the Serverless VPC Access connector
The Beads API Cloud Run service exists in the staging and production environments only. The dev-shared environment was decommissioned. Local development uses a Docker Compose beads-api container instead.
GCP Compute Engine — Dolt VM
The Dolt MySQL server runs on a GCE VM, not on Cloud Run. Dolt is a stateful MySQL-compatible database — it requires a persistent POSIX filesystem that Cloud Run cannot provide.
- Machine type:
e2-micro (staging), e2-small (production)
- No public IP — reachable only from within the VPC
- Startup script configures Dolt as a systemd service on first boot
The Dolt VM and Filestore exist in the staging and production environments only. Local development uses a Docker Compose beads-backend container.
GCP Filestore — Persistent Storage for Dolt
The Dolt VM mounts a Filestore NFS volume at /var/lib/dolt. This persists the database across VM reboots and re-provisions.
GCP GKE Autopilot — Workshop Containers
Workshop containers (shokunin-agent + OpenCode) run on a GKE Autopilot cluster managed in the dev-shared environment. GKE Autopilot provisions nodes automatically — no node pool configuration required.
See GKE Autopilot and Workshop Container Ingress below for subdomain routing and the provisioning flow.
GCP Cloud Run — Workshop Provisioner
The Shokunin Provisioner (infrastructure/shokunin-provisioner/) is a Cloud Run service that dynamically creates Workshop Deployments and Services in tenant Kubernetes namespaces.
- Triggered by
POST /api/workshops/[workshopId]/provision via a Cloud Tasks queue
- Authenticates inbound Cloud Tasks requests via OIDC token (SA:
shokunin-{env}-provisioner-sa)
- Creates a Kubernetes Deployment (shokunin-agent + OpenCode containers) and Service (ports 8090/4096) per workshop
- Updates Firestore provisioning state:
workshops/{workshopId}/provisioning/state
Firebase / Firestore — Application Data
The platform uses Firebase services for application data:
| Service | Description |
|---|
| Firestore | Primary application database. Stores projects, workshops, agents, workflow nodes/edges, audit logs, user profiles, and provisioning state. |
| Firebase Auth | Authentication provider (planned: will be wired through the shokunin-auth adapter). |
The Next.js app connects to Firestore and Firebase Auth directly via the Firebase SDK (no backend proxy).
GCP Secret Manager — Secrets
All platform secrets (API keys, database passwords, service account keys) are stored in GCP Secret Manager. Terraform is responsible for distributing them outward to Vercel, GitHub Actions, and Cloud Run — no secrets are set manually in those platforms.
See Secrets & Identity Management for the full design: namespace architecture, identity map, secret flow pipeline, and developer onboarding.
GCP Artifact Registry — Container Images
Docker images for all services are stored in Artifact Registry. CI/CD (GitHub Actions) pushes new images on merge; GKE and Cloud Run pull from this registry.
| Image | Service |
|---|
workshop:latest | Workshop container (shokunin-agent + OpenCode via supervisord) |
shokunin-provisioner:latest | Workshop Provisioner Cloud Run service |
beads-api:latest | Beads task-tracking API |
GCP IAM — Access Control
Service accounts and IAM roles control access between GCP services. Key accounts:
shokunin-dev-platform-sa — runs Terraform for all dev environments
shokunin-dev-gha-sa — GitHub Actions CI/CD via Workload Identity Federation (no stored keys)
shokunin-dev-provisioner-sa — Workshop Provisioner identity; receives Cloud Tasks OIDC tokens
shokunin-dev-firebase-admin — Firebase Admin SDK; also holds roles/run.viewer, roles/container.viewer, roles/cloudtasks.viewer for the Infrastructure Dashboard API route
shokunin-dev-vercel-caller-sa — authenticates Vercel runtime → GCP APIs
See Secrets & Identity Management for the complete identity map and IAM model.
Connectivity Summary
Users
│
▼
Vercel (Next.js app)
│ HTTPS
├──► GCP Cloud Run: Beads API ──MySQL (VPC)──► Dolt VM ──NFS──► Filestore
│ (staging / production only)
│
├──► Cloud Tasks queue → Shokunin Provisioner (Cloud Run)
│ │
│ └──► GKE: creates Workshop Deployment + Service
│
├──► Firebase (Firestore SDK)
│
└──► Firebase (Auth SDK)
GKE Autopilot (dev-shared)
└── Workshop Pods
├── shokunin-agent :8090 ◄── GKE Gateway → agent.<tenant>.<domain>
└── opencode :4096 ◄── GKE Gateway → opencode.<tenant>.<domain>
All GCP resources are managed with Terraform in terraform/. The infrastructure is split across reusable modules and environment-scoped root configurations.
Multi-tier environment structure
The platform uses three shared environment tiers plus per-developer sandboxes and per-tenant roots:
| Tier | Root | State prefix | Purpose |
|---|
| Dev-shared | platform/environments/dev-shared/ | platform/dev-shared | VPC, Artifact Registry, GKE Autopilot, GKE Gateway, Workshop Provisioner — deployed once, team-owned |
| Staging | platform/environments/staging/ | platform/staging | Full Beads stack (Dolt VM, Filestore, Cloud Run Beads API) for preview deployments |
| Production | platform/environments/production/ | platform/production | Full Beads stack for production |
| Developer sandbox | platform/environments/dev/ | platform/dev/<handle> | Per-developer Firestore database and Secret Manager access |
| Tenant | terraform/tenants/ | tenants/<tenant-id> | Per-tenant GKE namespace, HTTPRoutes, and Workshop workload |
Module structure
Resources are organised into single-purpose modules under platform/modules/:
| Module | What it manages |
|---|
networking/ | VPC, subnet, Serverless VPC Access connector, firewall rules |
artifact-registry/ | Docker image registry |
storage/ | Filestore NFS instance (staging/production only) |
secret-manager/ | Secret Manager secrets for platform credentials |
iam/ | Service accounts and all IAM bindings |
bastion/ | Dolt VM and its reserved internal IP (staging/production only) |
cloud-run/ | Beads API Cloud Run service (staging/production only) |
gke-autopilot/ | GKE Autopilot cluster |
gke-gateway/ | Shared GKE Gateway resource (one per cluster) |
vercel-env/ | Pushes configuration and secrets to Vercel environment variables |
Wrapper scripts
Always use ./terraform/scripts/tf or ./terraform/scripts/tf-tenant — they handle SA impersonation and config file wiring automatically:
# Shared dev environment
./terraform/scripts/tf dev-shared plan
./terraform/scripts/tf dev-shared apply
# Personal sandbox
./terraform/scripts/tf dev <your-handle> init
./terraform/scripts/tf dev <your-handle> plan
# Tenant infrastructure
./terraform/scripts/tf-tenant <tenant-id> plan
./terraform/scripts/tf-tenant <tenant-id> apply
Always run plan and review the output before apply. Never run terraform directly — the wrapper script ensures the correct service account is used and configuration files are wired correctly.
For full setup instructions including how to request infrastructure access, see terraform/README.md.
GCS infrastructure manifest
Every Terraform root writes an infra-manifest/{key}.json file to gs://shokunin-480309-tfstate/ on every apply. This allows runtime services and the Infrastructure Dashboard to discover resource names, SA emails, and endpoints without re-reading Terraform state.
| Manifest key | Written by | Contains |
|---|
dev-shared | platform/environments/dev-shared/manifest.tf | GKE cluster, Gateway, Provisioner URL/queue, SA emails, secret IDs |
staging | platform/environments/staging/manifest.tf | Dolt VM, Beads API URL, SA emails |
production | platform/environments/production/manifest.tf | Dolt VM, Beads API URL, SA emails |
tenant-{id} | terraform/tenants/manifest.tf | Namespace, HTTPRoute names, agent/opencode endpoint URLs, workshop GSA email |
The GcpInfraManifest TypeScript type in domains/infrastructure/types.ts defines the intended array-based schema for manifests consumed by the Infrastructure Dashboard. The Terraform manifest files currently write a flat nested structure — aligning them to the typed schema is in progress. Until then, the dashboard’s declared-resource sections fall back to live GCP API data only.
GKE Autopilot and Workshop Container Ingress
Workshop containers (shokunin-agent + OpenCode) run on GKE Autopilot and are exposed externally via a shared GKE Gateway (GKE Gateway API, not Ingress).
Subdomain routing
Each tenant gets two public subdomains routed through the shared Gateway:
| Subdomain | Routes to | Port |
|---|
agent.<tenant>.<domain> | Workshop service (shokunin-agent REST API) | 8090 |
opencode.<tenant>.<domain> | Workshop service (OpenCode web API) | 4096 |
Routing is hostname-based — no path routing. TLS terminates at the Gateway (Certificate Manager managed certificates). SSE/streaming responses are not buffered (BackendLBPolicy timeout: 3600 s, x-accel-buffering: no).
agentUrl in Firestore
The agentUrl field stored on the Workshop Firestore document holds the external Gateway URL — not an in-cluster DNS address. Example:
https://agent.acme-corp.dev.shokunin.app
This URL is written by the Next.js provision API route at provisioning time and used by the Agent Control Panel to target SSE connections, health polling, and deep links.
In-cluster services communicate directly via localhost:4096 (OpenCode) and localhost:8090 (shokunin-agent) since both processes run in the same pod.
Workshop provisioning flow
User clicks "Provision" in Agent Control Panel
│
▼
POST /api/workshops/[workshopId]/provision
│ idempotency check (canStartProvisioning)
│ writes status = "queued" to Firestore
│ writes agentUrl to Workshop document
▼
Cloud Tasks queue: shokunin-{env}-workshop-provisioning
│ OIDC-authenticated HTTP task (provisioner-sa)
▼
Shokunin Provisioner (Cloud Run)
│ creates K8s Deployment (shokunin-agent + opencode containers)
│ creates K8s Service (ports 8090, 4096)
│ updates Firestore status → "succeeded"
▼
GKE Gateway routes traffic to the new pod
Provisioning states: null → queued → running → succeeded / failed. The API returns 200 (no-op) when already queued or running, 202 when newly queued.
platform/environments/dev-shared/
├── module "gke" ← GKE Autopilot cluster (platform/modules/gke-autopilot/)
├── module "gke_gateway" ← Shared GKE Gateway resource (platform/modules/gke-gateway/)
└── provisioner.tf ← Cloud Tasks queue + Provisioner Cloud Run service
terraform/tenants/
├── module "namespace" ← GKE namespace, KSA, GSA, resource quotas
├── module "gateway_routes" ← Pre-created HTTPRoutes (agent.* + opencode.*)
└── module "workload" ← Workshop Deployment + Service + BEADS_DOLT_* env vars
See terraform/AGENTS.md for Terraform conventions and module structure.
Each tenant’s Kubernetes resources are managed by a dedicated Terraform root at terraform/tenants/, applied independently per tenant.
What it provisions
- GKE namespace — Kubernetes namespace
tenant-{id}, Kubernetes Service Account (KSA), GCP Service Account (GSA) bound via Workload Identity, LimitRange, ResourceQuota
- Gateway routes — Pre-creates HTTPRoute resources for
agent.{tenant}.{domain} and opencode.{tenant}.{domain} before any workshop is provisioned, so DNS and TLS certificate issuance can begin immediately
- Workshop workload — Workshop Deployment and Service; injects
BEADS_DOLT_* environment variables when beads_dolt_host is set in the tenant config
Applying tenant infrastructure
./terraform/scripts/tf-tenant <tenant-id> plan
./terraform/scripts/tf-tenant <tenant-id> apply
Tenant configuration lives at terraform/config/tenants/<tenant-id>.tfvars.
BEADS_DOLT_* environment variables
When a tenant config sets beads_dolt_host, the tenant-workload module injects five environment variables into the Workshop container:
| Variable | Purpose |
|---|
BEADS_DOLT_HOST | MySQL host of the tenant’s Dolt instance |
BEADS_DOLT_PORT | MySQL port (default: 3306) |
BEADS_DOLT_USER | MySQL user |
BEADS_DOLT_PASSWORD | MySQL password |
BEADS_DOLT_DATABASE | Database name (e.g. beads_staging) |
These allow the shokunin-agent and bd CLI inside the workshop container to connect to the team’s shared Beads Dolt database.
Local vs. staging vs. production
| Component | Local (Docker Compose) | Staging | Production |
|---|
| Dolt/Beads | beads-backend container on localhost:3307 | GCP Dolt VM + Filestore | GCP Dolt VM + Filestore |
| Beads API | beads-api container on localhost:8080 | GCP Cloud Run | GCP Cloud Run |
| Next.js | bun run dev on localhost:3000 | Vercel (preview deployments) | Vercel (production) |
| Firestore | GCP Firestore dev-<handle> | GCP Firestore staging | GCP Firestore production |
| Workshop containers | Not applicable locally | GKE Autopilot (dev-shared cluster) | GKE Autopilot (dev-shared cluster) |
| Secrets | .env (populated by scripts/env-sync) | GCP Secret Manager | GCP Secret Manager |
This overview is intentionally high-level. For the full secrets and IAM design, see Secrets & Identity Management. For the live Infrastructure Dashboard, see Infrastructure Dashboard. For Terraform deployment instructions and operational runbooks, refer to terraform/README.md.