# Troubleshooting — Locus Build

> Companion guide to [`SKILL.md`](./SKILL.md). Covers platform architecture and common issues.

## When To Load

Load this file only for architecture questions, stuck deployments, routing issues, or common failure modes.

## Table of Contents

- [Platform Architecture](#platform-architecture)
- [Debugging 401 Unauthorized Errors](#debugging-401-unauthorized-errors)
- [Error Pattern Recognition](#error-pattern-recognition)
- [Common Issues](#common-issues)
- [Reporting Bugs](#reporting-bugs)

## Platform Architecture

```
Internet
    ↓
*.buildwithlocus.com (load balancer + wildcard SSL)
    ↓
Edge Router (matches svc-{id} subdomain)
    ↓
Internal DNS (service-{id}.locus.local)
    ↓
Container (on port 8080)
```

**Key Components:**
- **Control Plane:** REST API with state storage
- **Build Pipeline:** Containerized builds with image registry
- **Orchestration:** Workflow engine with status tracking
- **Runtime:** Managed containers with service discovery (port 8080)
- **Routing:** Edge router with auto-subdomain and custom domain support
- **Git Server:** Git HTTP backend for push deployments
- **Cleanup:** Scheduled background jobs for lifecycle management (failed deployments stopped after 24h, addon deletions have 7-day grace period)

## Debugging 401 Unauthorized Errors

**This is the most common issue agents encounter.** When you get a 401, do NOT assume complex permission systems or account enablement tiers. Follow this flowchart:

```
Got 401 "Unauthorized"
    │
    ▼
Step 1: Test token validity
    curl -s $BASE_URL/auth/whoami -H "Authorization: Bearer $TOKEN"
    │
    ├── 401 → Token is expired or invalid
    │         Get a fresh token:
    │         TOKEN=$(curl -s -X POST $BASE_URL/auth/exchange \
    │           -H "Content-Type: application/json" \
    │           -d '{"apiKey":"YOUR_API_KEY"}' | jq -r '.token')
    │         Then retry the original operation.
    │
    └── 200 (returns user info) → Token is valid
              │
              ▼
         Step 2: Check your request
              - Correct HTTP method? (POST vs GET)
              - Correct URL path? (typos, missing /v1 prefix)
              - Correct base URL? (beta vs production)
              - Authorization header present and formatted?
                "Authorization: Bearer $TOKEN" (not "Bearer: $TOKEN")
```

**Key facts about Locus authentication:**
- There is **no two-tier permission system**. Every authenticated endpoint uses the same auth check.
- There is **no manual enablement** step beyond having a valid API key.
- If you can list projects, you can create environments and services — it's the same auth.
- The only gate beyond auth is **billing** (402, not 401) for service creation ($0.25 per service).
- JWTs expire after 30 days. Get a fresh token at the start of every debugging session.

## Error Pattern Recognition

Quick reference for diagnosing errors by pattern:

| Pattern | Likely cause | First step |
|---------|-------------|------------|
| **All endpoints return 401** | Token expired/invalid | `/auth/whoami` → refresh token |
| **All endpoints suddenly fail** | Token expired mid-session | Get fresh token, retry |
| **Some endpoints 401, others work** | Mixed base URLs (beta/prod) | Verify consistent `$BASE_URL` |
| **Service creation returns 402** | Insufficient credits | Check `GET /billing/balance` |
| **Specific endpoint 404** | Wrong URL path or resource deleted | Check URL syntax, verify resource exists |
| **Deployment stuck** | Build/infra issue, not auth | Check deployment status and logs |
| **503 after deploy reaches healthy** | Service discovery delay (normal) | Wait 60 seconds, retry |

## Common Issues

**Deployment stuck in `queued`:**
- Check deployment record exists: `GET /v1/deployments/{id}`
- Verify the Step Functions state machine exists and is ACTIVE for this environment (the control plane must have a valid `STATE_MACHINE_ARN`)
- Check the control plane task role has `states:StartExecution` permission on the state machine ARN
- Check control plane logs for errors when the deployment was created — a `StartExecution` failure is logged but may not surface in the API response
- If the state machine was recently created/updated, ensure the definition references the correct DynamoDB table for the environment

**Auto-subdomain returns 503 (most common issue after first deploy):**
- **If within 60 seconds of `healthy` status:** This is service discovery registration delay — **not a bug**. Wait 60 seconds and retry. This is the #1 issue agents and users hit after their first deploy.
- Verify the subdomain uses hyphens, not underscores: `https://svc-abc123.buildwithlocus.com` (not `svc_abc123`)
- Ensure container is listening on port 8080 (and responds to HTTP requests if `healthCheckPath` is set)
- Check service status: `GET /v1/services/$SERVICE_ID` — must be `healthy`
- Check service status: `GET /v1/services/{id}?include=runtime` — verify `runtime_instances.runningCount` > 0
- If 503 persists after 2+ minutes, the issue is likely with the container itself — check logs

**Build args not applied after redeploy:**
- `buildArgs` are only applied during **fresh builds** (new deployments from source). A `redeploy` (`POST /v1/services/:id/redeploy`) skips the build phase and reuses the last successful image — so build arg changes won't take effect.
- **Fix:** Push a new commit to trigger a fresh build, or create a new deployment via `POST /v1/deployments`.
- See the [Redeploy vs. Fresh Deploy](./deployment-workflows.md#redeploy-vs-fresh-deploy) comparison table for details.

**Build failed:**
- Check `lastLogs` on the deployment: `GET /v1/deployments/{id}` — the response includes the last 20 log lines from the build or runtime phase
- If `lastLogs` doesn't show the root cause, fetch the full log output: `GET /v1/deployments/{id}/logs`
- Search for errors across the full log: `GET /v1/deployments/{id}/logs/search?pattern=?ERROR%20?FATAL%20?Exception&since=1h`
- Common build failures:
  - **Missing Dockerfile**: No `Dockerfile` in the repo root or the service's `rootDir`
  - **Private repo access denied**: GitHub App not installed or repo not granted — check `GET /v1/github/repo-access?repo=owner/repo`
  - **Dependency install failed**: `npm install`, `pip install`, or `go mod download` errors — check dependency versions and lockfiles
  - **Docker build error**: Syntax errors in Dockerfile, missing base image, or build-time failures
  - **Health check timeout** (only when `healthCheckPath` is set): Container starts but the health check endpoint doesn't return HTTP 200 within the timeout. Ensure the endpoint exists and the app listens on port 8080
  - **Port mismatch**: App not listening on port 8080 — Locus injects `PORT=8080` and routes to that port
  - **Architecture mismatch** (`exec format error`): Locus runs on ARM64 (AWS Graviton). Pre-built images must be built for `linux/arm64`. Build with `docker build --platform linux/arm64` if building on x86. Images built from source (GitHub or git push) are handled automatically

**Rollback says "No previous healthy deployment with an image found":**
- Rollback requires a *previous* healthy deployment with a persisted `imageUri`
- The very first deployment can't be rolled back (nothing to roll back to)
- `imageUri` is only persisted when a deployment completes the full pipeline to `healthy` — if all previous deployments failed, rollback won't work
- For git-push deploys: the image is persisted on success, so rollback works after the first successful deploy
- **Fix:** If rollback isn't available, trigger a fresh deployment (`POST /v1/deployments`) instead

**Restart fails with "no running ECS instances":**
- Deployment shows `healthy` but ECS tasks haven't started yet. `healthy` means the pipeline completed — ECS task startup is asynchronous
- Use `GET /v1/services/:id?include=runtime` to check `runtime_instances.runningCount`
- If `runtime_instances.status` is `not_deployed`, wait 30-60 seconds and check again
- If tasks still aren't running after 60s, trigger a fresh deployment instead of restart

**Pre-built image fails health checks (port mismatch or architecture):**
- Most common cause: the image listens on port 80 (nginx, httpd defaults) but Locus routes to port 8080
- Locus injects `PORT=8080` — make sure your image reads `$PORT` or is explicitly configured for 8080
- For nginx: add `ENV PORT=8080` and `EXPOSE 8080` to the Dockerfile, and use an `envsubst` template in the nginx config that reads `$PORT`
- Second common cause: image built for `linux/amd64` but Locus runs on ARM64 (Graviton). Error will show `exec format error` in logs
- Fix: rebuild with `docker build --platform linux/arm64` or use a multi-arch base image
- Check logs: `GET /v1/deployments/{id}` — `lastLogs` will show the container output or health check timeout

**Deployment stuck in `deploying` with running containers (health check flapping):**
- Only applies when `healthCheckPath` is set. Container starts but the health check endpoint alternates between 200 and errors — ECS keeps replacing tasks
- Check logs for startup errors: `GET /v1/services/{id}/logs`
- Common causes:
  - App takes too long to start (cold JVM, large dependency loading) — increase `memory` or optimize startup
  - Health endpoint returns 200 before the app is truly ready (e.g., returns 200 on `/` but the health logic itself crashes)
  - App runs out of memory during request handling — check for `OOMKilled` in logs
- If deployment is stuck for >10 minutes, cancel it: `POST /v1/deployments/{id}/cancel`

**Container keeps restarting / crash loop:**
- Check `lastLogs` on the failed deployment: `GET /v1/deployments/{id}`
- Common causes:
  - **OOM (Out of Memory):** Logs show `OOMKilled` or `Killed` — increase `runtime.memory` via `PATCH /v1/services/{id}`
  - **Missing environment variables:** App crashes because a required env var is not set — check `GET /v1/variables/service/{id}/resolved`
  - **Bad startCommand:** Syntax error or binary not found in the `startCommand` override — verify the command runs locally first
  - **Missing system dependencies:** Alpine images may need `apk add` for native modules (and `wget` if using `healthCheckPath`)
- After fixing the root cause, trigger a fresh deployment: `POST /v1/deployments`

**Private repo clone failed:**
- Check access: `GET /v1/github/repo-access?repo=owner/repo` — if `accessible: false`, the GitHub App isn't installed or doesn't have access to that specific repo
- Direct the user to connect GitHub at **https://buildwithlocus.com/integrations** (NOT the raw GitHub App install URL)
- Verify installations: `GET /v1/github/installations` — confirms the GitHub App is connected
- If already connected but repo not accessible: the user may need to reconfigure permissions at **https://buildwithlocus.com/integrations** to include the specific repo

**Deployment fails immediately with `from-locusbuild` (no source code):**
- The `repo` field in `from-locusbuild` must be a **real GitHub repository** — Locus clones source code from it
- Using a fake or placeholder repo value (e.g., `"local/my-app"`) will cause the build to fail because there's nothing to clone
- If you have local code and no GitHub repo, do NOT use `from-locusbuild`. Instead, use the manual setup workflow:
  1. Create project + environment
  2. Create services with `source.type: "s3"` and `rootDir` pointing to each service's subdirectory
  3. Provision addons (Postgres/Redis) before the first push
  4. Add the Locus git remote and push: `git push locus main`
- See [git-deploy.md](./git-deploy.md) for the full local code workflow

**Monorepo build fails (nixpacks can't detect project at repo root):**
- Add a `.locusbuild` file at the repo root defining each service's `path` subdirectory
- Use `POST /v1/projects/from-repo` to set up the project from the `.locusbuild` file in one call
- See [monorepo.md](./monorepo.md) for the file format
- If `.locusbuild` exists but a service still fails: ensure the `path` directory contains a valid project (e.g., `package.json`, `go.mod`, `requirements.txt`)

**Connection to addon failed:**
- Poll addon status until `available`: `GET /v1/addons/{id}`
- Check resolved variables include connection string: `GET /v1/variables/service/{id}/resolved`
- Verify service is in same environment as addon

**Custom domain not working (BYOD):**
- Run verify: `POST /v1/domains/{id}/verify` — check `cnameVerified` and `certificateValidated`
- If `cnameVerified: false`: your DNS CNAME doesn't point to Locus yet (check `cnameTarget` in domain details)
- If `certificateValidated: false`: SSL cert not issued yet — ensure validation CNAME records are set
- Confirm domain is attached to service: `GET /v1/domains/{id}`

**Purchased domain not working:**
- Check registration status: `GET /v1/domains/{id}/registration-status`
- If `registering`: wait and poll again (can take up to 15 minutes)
- If `registered` but not routing: domain may need a few minutes for DNS propagation
- Confirm domain is attached to service: `GET /v1/domains/{id}`

## Error Code Recovery

The API returns structured error codes on 401 responses. Use these to determine the exact fix:

| Error Code | Cause | Fix Command |
|------------|-------|-------------|
| `AUTH_MISSING_TOKEN` | No `Authorization: Bearer` header sent | Add `-H "Authorization: Bearer $TOKEN"` to your request |
| `AUTH_TOKEN_EXPIRED` | JWT expired (tokens last 30 days) | `TOKEN=$(curl -s -X POST $BASE_URL/auth/exchange -H "Content-Type: application/json" -d '{"apiKey":"YOUR_KEY"}' \| jq -r '.token')` |
| `AUTH_TOKEN_INVALID` | Token is malformed or from wrong environment (beta vs prod) | Re-exchange your API key. Verify `$BASE_URL` matches your key's environment. |
| `AUTH_SERVICE_ERROR` | Authentication service temporarily unavailable | Wait 5 seconds and retry the same request |

**Non-401 errors:**

| HTTP Status | Cause | Fix |
|-------------|-------|-----|
| 402 | Insufficient credits for service creation | Add credits: `POST /v1/billing/pay` with `{"amount": 1, "apiKey": "your_key"}` |
| 404 | Resource not found | Verify the resource ID exists with a GET/list call |
| 503 after `healthy` | Service discovery registration delay | Wait 60 seconds after deployment reaches `healthy`, then retry |

For the full error decision tree and recovery scripts, see [agent-quickstart.md](./agent-quickstart.md#error-recovery-cheatsheet).

## Common Mistakes

**Wrong port:** Every Locus container must listen on port 8080. The platform injects `PORT=8080` automatically. If your app listens on a different port (e.g., 3000, 80), it won't receive traffic. Make sure your app reads the `PORT` environment variable.

**Wrong architecture for pre-built images:** Locus runs on ARM64 (AWS Graviton). Pre-built images must be built for `linux/arm64`. Build with `docker build --platform linux/arm64` on non-ARM machines. You'll see `exec format error` in logs if the architecture is wrong.

**Redis key collisions:** Each Redis addon gets its own database number, but if multiple services share the same Redis addon, use key prefixes (e.g., `auth:sessions:`, `cache:pages:`) to avoid collisions.

**Forgot to redeploy after env var / addon changes:** Environment variables are injected at deploy time. After changing variables (`PUT /v1/variables/service/:id`) or provisioning a new addon, you must trigger a new deployment for the service to pick up the changes.

**buildArgs not applied on redeploy:** `POST /v1/services/:id/redeploy` reuses the last successful image and skips the build phase. `buildArgs` changes only take effect on fresh builds. Push a new commit or create a new deployment via `POST /v1/deployments` to apply build arg changes.

## Reporting Bugs

If you encounter a platform bug or unexpected behavior, file a bug report via the API:

```bash
# Create a bug report
curl -X POST https://api.buildwithlocus.com/v1/bug-reports \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Deployment stuck in queued state",
    "description": "Deployment deploy_abc123 has been queued for over 10 minutes with no progress.",
    "deploymentId": "deploy_abc123",
    "severity": "high"
  }'
```

Include `serviceId` and/or `deploymentId` when the bug relates to a specific resource. Use `metadata` for any extra context (logs, error messages, reproduction steps).

For platform status and updates, see: https://buildwithlocus.com
