Alethia Labs

Internal Architecture

Main loop, heartbeat mechanism, graceful shutdown, and the core shared package.

Internal Architecture

Runner is a single Go binary with a straightforward architecture: a main poll loop, a heartbeat goroutine, and shared libraries from the core package.

Main Loop

runner start
  ├── Register with Alethia API
  ├── Start heartbeat goroutine (every 30s)
  └── Poll loop (every 10s)
       ├── POST /api/jobs/claim
       ├── If job claimed:
       │    ├── Assume cloud credentials
       │    ├── Execute job (Terraform, ArgoCD, etc.)
       │    ├── Stream logs
       │    └── Report result
       └── If no job: sleep 10s, repeat

The poll loop runs on the main goroutine. The heartbeat runs as a background goroutine. Both share the Runner configuration.

Configuration

Runner receives its configuration via environment variables:

VariableDescription
ALETHIA_WEB_ORIGINBase URL of the Alethia instance
ALETHIA_WORKER_IDUnique Runner identifier (assigned at registration)
ALETHIA_WORKER_TOKENAuthentication token for API calls
ALETHIA_WORKER_MODEcloud-hosted or self-hosted
SUPABASE_S3_ENDPOINTS3 endpoint for Terraform state
SUPABASE_S3_REGIONS3 region
SUPABASE_STORAGE_KEY_IDS3 access key
SUPABASE_STORAGE_SECRET_KEYS3 secret key
INFRACOST_API_KEYOptional, for cost estimation

Heartbeat Mechanism

The heartbeat goroutine sends POST /api/workers/heartbeat every 30 seconds with:

  • Worker ID
  • Current status (ONLINE or DRAINING)
  • Currently executing job ID (if any)

Alethia uses the heartbeat to:

  • Mark Runners as ONLINE (heartbeat within 60s) or OFFLINE (heartbeat missed)
  • Detect dead Runners and requeue their jobs
  • Track which Runner is running which job

If a Runner crashes mid-job:

  1. Heartbeat stops arriving
  2. After 60 seconds, Alethia marks the Runner OFFLINE
  3. If no log updates for 5+ minutes, the job is marked FAILED
  4. User can retry the job — Terraform state is preserved in S3

Graceful Shutdown

When Runner receives SIGINT or SIGTERM:

  1. Sets status to DRAINING — stops accepting new jobs
  2. Sends a DRAINING heartbeat to Alethia
  3. If a job is in progress, waits for it to complete (up to 10-minute grace period)
  4. Sends final heartbeat with OFFLINE status
  5. Exits

ECS Fargate sends SIGTERM when stopping a task, so Runners always attempt a graceful shutdown.

API Client

The WorkerAPIClient handles all communication with Alethia:

MethodEndpointPurpose
ClaimJob()POST /api/jobs/claimAtomic job claiming
SendHeartbeat()POST /api/workers/heartbeatKeepalive
UpdateJobStatus()PUT /api/jobs/{id}/statusStatus transitions
SendLogs()POST /api/jobs/{id}/logsLog batch delivery
UploadPlanArtifact()POST /api/jobs/{id}/plan-artifactPlan JSON upload
DownloadPlanArtifact()GET /api/jobs/{id}/plan-artifactPlan JSON download

All requests include X-Worker-ID and X-Worker-Token headers for authentication.

Shared core Package

Runner and Alethia CLI both import from packages/core/:

PackagePurpose
provisioner/Bootstrap, deploy, destroy orchestration
terraform/Terraform CLI wrapper (init, plan, apply, destroy)
cloud/aws/AWS SDK operations (STS, resource discovery)
cloud/gcp/GCP SDK operations (WIF, resource discovery)
cloud/azure/Azure SDK operations (federated auth, resource discovery)
argocd/ArgoCD installation and configuration
helm/Helm chart operations
git/Git clone, branch, commit, push
k8s/kubectl operations
infracost/Infracost CLI wrapper and output parsing

This shared package ensures both the CLI and Runner use the same provisioning logic.

On this page