Skip to main content

Infrastructure Overview

This page provides a conceptual overview of the Shokunin Platform’s cloud infrastructure. It is intended to help contributors understand the deployment topology without requiring access to vendor dashboards or credentials. For operational procedures, deployment steps, or secret management, refer to terraform/README.md in the repository.

Infrastructure at a Glance

The Shokunin Platform uses two hosting providers:
ProviderRole
VercelHosts the Next.js frontend (serverless, edge-distributed)
GCP (Google Cloud Platform)Provides data persistence, compute, storage, and AI model access

Architecture Diagram

The diagram below shows how the major components connect. It is derived from terraform/README.md. Platform Infrastructure Diagram

Component Overview

Vercel — Frontend Hosting

The Next.js app (v0-shokunin-ai-platform) is deployed to Vercel. Vercel provides:
  • Serverless Next.js runtime (App Router, API routes)
  • Global CDN for static assets
  • Preview deployments for pull requests
The frontend communicates with GCP services via HTTPS — it never talks to Dolt directly.

GCP Cloud Run — Beads API

The Beads API runs on Cloud Run and bridges HTTPS requests (from Vercel and the bd CLI) to the Dolt MySQL database over the private VPC.
  • Containerized; image stored in GCP Artifact Registry
  • Authenticates inbound requests via X-Beads-API-Key header
  • Egresses to the Dolt VM via the Serverless VPC Access connector
The Beads API Cloud Run service exists in the staging and production environments only. The dev-shared environment was decommissioned. Local development uses a Docker Compose beads-api container instead.

GCP Compute Engine — Dolt VM

The Dolt MySQL server runs on a GCE VM, not on Cloud Run. Dolt is a stateful MySQL-compatible database — it requires a persistent POSIX filesystem that Cloud Run cannot provide.
  • Machine type: e2-micro (staging), e2-small (production)
  • No public IP — reachable only from within the VPC
  • Startup script configures Dolt as a systemd service on first boot
The Dolt VM and Filestore exist in the staging and production environments only. Local development uses a Docker Compose beads-backend container.

GCP Filestore — Persistent Storage for Dolt

The Dolt VM mounts a Filestore NFS volume at /var/lib/dolt. This persists the database across VM reboots and re-provisions.

GCP GKE Autopilot — Workshop Containers

Workshop containers (shokunin-agent + OpenCode) run on a GKE Autopilot cluster managed in the dev-shared environment. GKE Autopilot provisions nodes automatically — no node pool configuration required. See GKE Autopilot and Workshop Container Ingress below for subdomain routing and the provisioning flow.

GCP Cloud Run — Workshop Provisioner

The Shokunin Provisioner (infrastructure/shokunin-provisioner/) is a Cloud Run service that dynamically creates Workshop Deployments and Services in tenant Kubernetes namespaces.
  • Triggered by POST /api/workshops/[workshopId]/provision via a Cloud Tasks queue
  • Authenticates inbound Cloud Tasks requests via OIDC token (SA: shokunin-{env}-provisioner-sa)
  • Creates a Kubernetes Deployment (shokunin-agent + OpenCode containers) and Service (ports 8090/4096) per workshop
  • Updates Firestore provisioning state: workshops/{workshopId}/provisioning/state

Firebase / Firestore — Application Data

The platform uses Firebase services for application data:
ServiceDescription
FirestorePrimary application database. Stores projects, workshops, agents, workflow nodes/edges, audit logs, user profiles, and provisioning state.
Firebase AuthAuthentication provider (planned: will be wired through the shokunin-auth adapter).
The Next.js app connects to Firestore and Firebase Auth directly via the Firebase SDK (no backend proxy).

GCP Secret Manager — Secrets

All platform secrets (API keys, database passwords, service account keys) are stored in GCP Secret Manager. Terraform is responsible for distributing them outward to Vercel, GitHub Actions, and Cloud Run — no secrets are set manually in those platforms. See Secrets & Identity Management for the full design: namespace architecture, identity map, secret flow pipeline, and developer onboarding.

GCP Artifact Registry — Container Images

Docker images for all services are stored in Artifact Registry. CI/CD (GitHub Actions) pushes new images on merge; GKE and Cloud Run pull from this registry.
ImageService
workshop:latestWorkshop container (shokunin-agent + OpenCode via supervisord)
shokunin-provisioner:latestWorkshop Provisioner Cloud Run service
beads-api:latestBeads task-tracking API

GCP IAM — Access Control

Service accounts and IAM roles control access between GCP services. Key accounts:
  • shokunin-dev-platform-sa — runs Terraform for all dev environments
  • shokunin-dev-gha-sa — GitHub Actions CI/CD via Workload Identity Federation (no stored keys)
  • shokunin-dev-provisioner-sa — Workshop Provisioner identity; receives Cloud Tasks OIDC tokens
  • shokunin-dev-firebase-admin — Firebase Admin SDK; also holds roles/run.viewer, roles/container.viewer, roles/cloudtasks.viewer for the Infrastructure Dashboard API route
  • shokunin-dev-vercel-caller-sa — authenticates Vercel runtime → GCP APIs
See Secrets & Identity Management for the complete identity map and IAM model.

Connectivity Summary

Users


Vercel (Next.js app)
  │ HTTPS
  ├──► GCP Cloud Run: Beads API ──MySQL (VPC)──► Dolt VM ──NFS──► Filestore
  │    (staging / production only)

  ├──► Cloud Tasks queue → Shokunin Provisioner (Cloud Run)
  │                               │
  │                               └──► GKE: creates Workshop Deployment + Service

  ├──► Firebase (Firestore SDK)

  └──► Firebase (Auth SDK)

GKE Autopilot (dev-shared)
  └── Workshop Pods
      ├── shokunin-agent :8090  ◄── GKE Gateway → agent.<tenant>.<domain>
      └── opencode       :4096  ◄── GKE Gateway → opencode.<tenant>.<domain>

Terraform

All GCP resources are managed with Terraform in terraform/. The infrastructure is split across reusable modules and environment-scoped root configurations.

Multi-tier environment structure

The platform uses three shared environment tiers plus per-developer sandboxes and per-tenant roots:
TierRootState prefixPurpose
Dev-sharedplatform/environments/dev-shared/platform/dev-sharedVPC, Artifact Registry, GKE Autopilot, GKE Gateway, Workshop Provisioner — deployed once, team-owned
Stagingplatform/environments/staging/platform/stagingFull Beads stack (Dolt VM, Filestore, Cloud Run Beads API) for preview deployments
Productionplatform/environments/production/platform/productionFull Beads stack for production
Developer sandboxplatform/environments/dev/platform/dev/<handle>Per-developer Firestore database and Secret Manager access
Tenantterraform/tenants/tenants/<tenant-id>Per-tenant GKE namespace, HTTPRoutes, and Workshop workload

Module structure

Resources are organised into single-purpose modules under platform/modules/:
ModuleWhat it manages
networking/VPC, subnet, Serverless VPC Access connector, firewall rules
artifact-registry/Docker image registry
storage/Filestore NFS instance (staging/production only)
secret-manager/Secret Manager secrets for platform credentials
iam/Service accounts and all IAM bindings
bastion/Dolt VM and its reserved internal IP (staging/production only)
cloud-run/Beads API Cloud Run service (staging/production only)
gke-autopilot/GKE Autopilot cluster
gke-gateway/Shared GKE Gateway resource (one per cluster)
vercel-env/Pushes configuration and secrets to Vercel environment variables

Wrapper scripts

Always use ./terraform/scripts/tf or ./terraform/scripts/tf-tenant — they handle SA impersonation and config file wiring automatically:
# Shared dev environment
./terraform/scripts/tf dev-shared plan
./terraform/scripts/tf dev-shared apply

# Personal sandbox
./terraform/scripts/tf dev <your-handle> init
./terraform/scripts/tf dev <your-handle> plan

# Tenant infrastructure
./terraform/scripts/tf-tenant <tenant-id> plan
./terraform/scripts/tf-tenant <tenant-id> apply
Always run plan and review the output before apply. Never run terraform directly — the wrapper script ensures the correct service account is used and configuration files are wired correctly.
For full setup instructions including how to request infrastructure access, see terraform/README.md.

GCS infrastructure manifest

Every Terraform root writes an infra-manifest/{key}.json file to gs://shokunin-480309-tfstate/ on every apply. This allows runtime services and the Infrastructure Dashboard to discover resource names, SA emails, and endpoints without re-reading Terraform state.
Manifest keyWritten byContains
dev-sharedplatform/environments/dev-shared/manifest.tfGKE cluster, Gateway, Provisioner URL/queue, SA emails, secret IDs
stagingplatform/environments/staging/manifest.tfDolt VM, Beads API URL, SA emails
productionplatform/environments/production/manifest.tfDolt VM, Beads API URL, SA emails
tenant-{id}terraform/tenants/manifest.tfNamespace, HTTPRoute names, agent/opencode endpoint URLs, workshop GSA email
The GcpInfraManifest TypeScript type in domains/infrastructure/types.ts defines the intended array-based schema for manifests consumed by the Infrastructure Dashboard. The Terraform manifest files currently write a flat nested structure — aligning them to the typed schema is in progress. Until then, the dashboard’s declared-resource sections fall back to live GCP API data only.

GKE Autopilot and Workshop Container Ingress

Workshop containers (shokunin-agent + OpenCode) run on GKE Autopilot and are exposed externally via a shared GKE Gateway (GKE Gateway API, not Ingress).

Subdomain routing

Each tenant gets two public subdomains routed through the shared Gateway:
SubdomainRoutes toPort
agent.<tenant>.<domain>Workshop service (shokunin-agent REST API)8090
opencode.<tenant>.<domain>Workshop service (OpenCode web API)4096
Routing is hostname-based — no path routing. TLS terminates at the Gateway (Certificate Manager managed certificates). SSE/streaming responses are not buffered (BackendLBPolicy timeout: 3600 s, x-accel-buffering: no).

agentUrl in Firestore

The agentUrl field stored on the Workshop Firestore document holds the external Gateway URL — not an in-cluster DNS address. Example:
https://agent.acme-corp.dev.shokunin.app
This URL is written by the Next.js provision API route at provisioning time and used by the Agent Control Panel to target SSE connections, health polling, and deep links.
In-cluster services communicate directly via localhost:4096 (OpenCode) and localhost:8090 (shokunin-agent) since both processes run in the same pod.

Workshop provisioning flow

User clicks "Provision" in Agent Control Panel


POST /api/workshops/[workshopId]/provision
  │  idempotency check (canStartProvisioning)
  │  writes status = "queued" to Firestore
  │  writes agentUrl to Workshop document

Cloud Tasks queue: shokunin-{env}-workshop-provisioning
  │  OIDC-authenticated HTTP task (provisioner-sa)

Shokunin Provisioner (Cloud Run)
  │  creates K8s Deployment (shokunin-agent + opencode containers)
  │  creates K8s Service (ports 8090, 4096)
  │  updates Firestore status → "succeeded"

GKE Gateway routes traffic to the new pod
Provisioning states: nullqueuedrunningsucceeded / failed. The API returns 200 (no-op) when already queued or running, 202 when newly queued.

Terraform module topology

platform/environments/dev-shared/
  ├── module "gke"              ← GKE Autopilot cluster (platform/modules/gke-autopilot/)
  ├── module "gke_gateway"      ← Shared GKE Gateway resource (platform/modules/gke-gateway/)
  └── provisioner.tf            ← Cloud Tasks queue + Provisioner Cloud Run service

terraform/tenants/
  ├── module "namespace"        ← GKE namespace, KSA, GSA, resource quotas
  ├── module "gateway_routes"   ← Pre-created HTTPRoutes (agent.* + opencode.*)
  └── module "workload"         ← Workshop Deployment + Service + BEADS_DOLT_* env vars
See terraform/AGENTS.md for Terraform conventions and module structure.

Tenant Terraform

Each tenant’s Kubernetes resources are managed by a dedicated Terraform root at terraform/tenants/, applied independently per tenant.

What it provisions

  1. GKE namespace — Kubernetes namespace tenant-{id}, Kubernetes Service Account (KSA), GCP Service Account (GSA) bound via Workload Identity, LimitRange, ResourceQuota
  2. Gateway routes — Pre-creates HTTPRoute resources for agent.{tenant}.{domain} and opencode.{tenant}.{domain} before any workshop is provisioned, so DNS and TLS certificate issuance can begin immediately
  3. Workshop workload — Workshop Deployment and Service; injects BEADS_DOLT_* environment variables when beads_dolt_host is set in the tenant config

Applying tenant infrastructure

./terraform/scripts/tf-tenant <tenant-id> plan
./terraform/scripts/tf-tenant <tenant-id> apply
Tenant configuration lives at terraform/config/tenants/<tenant-id>.tfvars.

BEADS_DOLT_* environment variables

When a tenant config sets beads_dolt_host, the tenant-workload module injects five environment variables into the Workshop container:
VariablePurpose
BEADS_DOLT_HOSTMySQL host of the tenant’s Dolt instance
BEADS_DOLT_PORTMySQL port (default: 3306)
BEADS_DOLT_USERMySQL user
BEADS_DOLT_PASSWORDMySQL password
BEADS_DOLT_DATABASEDatabase name (e.g. beads_staging)
These allow the shokunin-agent and bd CLI inside the workshop container to connect to the team’s shared Beads Dolt database.

Local vs. staging vs. production

ComponentLocal (Docker Compose)StagingProduction
Dolt/Beadsbeads-backend container on localhost:3307GCP Dolt VM + FilestoreGCP Dolt VM + Filestore
Beads APIbeads-api container on localhost:8080GCP Cloud RunGCP Cloud Run
Next.jsbun run dev on localhost:3000Vercel (preview deployments)Vercel (production)
FirestoreGCP Firestore dev-<handle>GCP Firestore stagingGCP Firestore production
Workshop containersNot applicable locallyGKE Autopilot (dev-shared cluster)GKE Autopilot (dev-shared cluster)
Secrets.env (populated by scripts/env-sync)GCP Secret ManagerGCP Secret Manager

This overview is intentionally high-level. For the full secrets and IAM design, see Secrets & Identity Management. For the live Infrastructure Dashboard, see Infrastructure Dashboard. For Terraform deployment instructions and operational runbooks, refer to terraform/README.md.