Back to docs
Operations

Incident Response

This page covers how Dagy captures exceptions, provides diagnostic data, and supports incident investigation.

Exception Trace Capture

When an unhandled exception occurs during API request processing, Dagy automatically captures a full exception trace and writes it to S3.

Trace Format

Each trace is a JSON file stored in S3 with the following structure:

{
  "timestamp": "2026-01-15T10:30:00.123456Z",
  "request_id": "abc-123-def",
  "function_name": "dagy-api-prod",
  "exception_type": "ValueError",
  "exception_message": "Invalid flow spec",
  "traceback": "Traceback (most recent call last):\n  ...",
  "http_method": "POST",
  "http_path": "/runs",
  "org_id": "org_123",
  "event": { ... }
}

S3 Storage Layout

Traces are scoped by organization for tenant isolation:

s3://<traces-bucket>/
  exceptions/
    <org_id>/
      2026/01/15/
        103000123456_abc-123-def.json

S3 object metadata includes: exception-type, http-method, http-path, request-id, timestamp, and org-id for efficient listing without downloading the full trace.

Configuration

Exception capture requires the DAGY_TRACES_BUCKET environment variable to be set. If not configured, exceptions are logged but not persisted to S3.

Investigating Exceptions

Super Admin Exception Management

Super admins can browse and manage exception traces across all organizations via the admin API:

# List recent exceptions
curl "https://api.dagy.io/admin/exceptions?limit=50" \
  -H "Authorization: Bearer <token>"

# View trace content
curl "https://api.dagy.io/admin/exceptions/content?key=exceptions/org_123/2026/01/15/trace.json" \
  -H "Authorization: Bearer <token>"

# Download trace file (presigned URL, 15-minute expiry)
curl "https://api.dagy.io/admin/exceptions/download?key=exceptions/org_123/2026/01/15/trace.json" \
  -H "Authorization: Bearer <token>"

# Delete traces
curl -X DELETE "https://api.dagy.io/admin/exceptions" \
  -H "Authorization: Bearer <token>" \
  -d '{"keys": ["exceptions/org_123/2026/01/15/trace.json"]}'

See the Super Admin API Reference for full endpoint documentation.

Note: Deleting traces requires s3:DeleteObject IAM permission on the traces bucket.

Audit Trail

All mutation operations are audit-logged with before/after state snapshots, actor identity, and IP address. This provides a complete timeline for incident investigation:

  • Resource changes: Flow deployments, schedule modifications, secret updates
  • Access events: Token creation, role changes, permission denials
  • Admin actions: Customer suspension, impersonation, session revocation

Query the audit trail:

curl "https://api.dagy.io/audit-logs?resource_type=run&limit=100" \
  -H "Authorization: Bearer <token>"

Escalation Workflow

  1. Detection: Alert rules trigger notifications via configured channels (Slack, email, webhook, PagerDuty)
  2. Triage: Check the run status and task-level errors via GET /runs/{run_id} and GET /runs/{run_id}/logs
  3. Diagnosis: Review exception traces for the failing org/flow via GET /admin/exceptions
  4. Resolution: Fix the root cause and redeploy, or cancel the stuck run via POST /runs/{run_id}/cancel
  5. Post-mortem: Review the audit log for the timeline of changes leading to the incident

Run Failure Investigation

API Runs

# Get run details and status
curl "https://api.dagy.io/runs/<run_id>" \
  -H "Authorization: Bearer <token>"

# Get run logs
curl "https://api.dagy.io/runs/<run_id>/logs" \
  -H "Authorization: Bearer <token>"

# Cancel a stuck run
curl -X POST "https://api.dagy.io/runs/<run_id>/cancel" \
  -H "Authorization: Bearer <token>"

Local Runs

For run_local() failures, check:

  • Run log: ~/.dagy/runs/<run_id>/run.log
  • Task logs: ~/.dagy/runs/<run_id>/task_runs/<task_id>/task.log
  • Task metadata: ~/.dagy/runs/<run_id>/task_runs/<task_id>/metadata.json
  • Artifacts: ~/.dagy/artifacts/<run_id>/<task_id>/

Common failure patterns:

ErrorCauseResolution
TaskTimeoutErrorTask exceeded timeout_secondsIncrease timeout or optimize task
TaskFailedErrorTask raised an unhandled exceptionCheck task logs for stack trace
DependencyCycleErrorCircular dependencies in DAGReview task depends_on graph
ArtifactSerializationErrorTask output not JSON-serializableReturn serializable data or use LocalArtifact