Incident Response
This page covers how Dagy captures exceptions, provides diagnostic data, and supports incident investigation.
Exception Trace Capture
When an unhandled exception occurs during API request processing, Dagy automatically captures a full exception trace and writes it to S3.
Trace Format
Each trace is a JSON file stored in S3 with the following structure:
{
"timestamp": "2026-01-15T10:30:00.123456Z",
"request_id": "abc-123-def",
"function_name": "dagy-api-prod",
"exception_type": "ValueError",
"exception_message": "Invalid flow spec",
"traceback": "Traceback (most recent call last):\n ...",
"http_method": "POST",
"http_path": "/runs",
"org_id": "org_123",
"event": { ... }
}
S3 Storage Layout
Traces are scoped by organization for tenant isolation:
s3://<traces-bucket>/
exceptions/
<org_id>/
2026/01/15/
103000123456_abc-123-def.json
S3 object metadata includes: exception-type, http-method, http-path, request-id, timestamp, and org-id for efficient listing without downloading the full trace.
Configuration
Exception capture requires the DAGY_TRACES_BUCKET environment variable to be set. If not configured, exceptions are logged but not persisted to S3.
Investigating Exceptions
Super Admin Exception Management
Super admins can browse and manage exception traces across all organizations via the admin API:
# List recent exceptions
curl "https://api.dagy.io/admin/exceptions?limit=50" \
-H "Authorization: Bearer <token>"
# View trace content
curl "https://api.dagy.io/admin/exceptions/content?key=exceptions/org_123/2026/01/15/trace.json" \
-H "Authorization: Bearer <token>"
# Download trace file (presigned URL, 15-minute expiry)
curl "https://api.dagy.io/admin/exceptions/download?key=exceptions/org_123/2026/01/15/trace.json" \
-H "Authorization: Bearer <token>"
# Delete traces
curl -X DELETE "https://api.dagy.io/admin/exceptions" \
-H "Authorization: Bearer <token>" \
-d '{"keys": ["exceptions/org_123/2026/01/15/trace.json"]}'
See the Super Admin API Reference for full endpoint documentation.
Note: Deleting traces requires
s3:DeleteObjectIAM permission on the traces bucket.
Audit Trail
All mutation operations are audit-logged with before/after state snapshots, actor identity, and IP address. This provides a complete timeline for incident investigation:
- Resource changes: Flow deployments, schedule modifications, secret updates
- Access events: Token creation, role changes, permission denials
- Admin actions: Customer suspension, impersonation, session revocation
Query the audit trail:
curl "https://api.dagy.io/audit-logs?resource_type=run&limit=100" \
-H "Authorization: Bearer <token>"
Escalation Workflow
- Detection: Alert rules trigger notifications via configured channels (Slack, email, webhook, PagerDuty)
- Triage: Check the run status and task-level errors via
GET /runs/{run_id}andGET /runs/{run_id}/logs - Diagnosis: Review exception traces for the failing org/flow via
GET /admin/exceptions - Resolution: Fix the root cause and redeploy, or cancel the stuck run via
POST /runs/{run_id}/cancel - Post-mortem: Review the audit log for the timeline of changes leading to the incident
Run Failure Investigation
API Runs
# Get run details and status
curl "https://api.dagy.io/runs/<run_id>" \
-H "Authorization: Bearer <token>"
# Get run logs
curl "https://api.dagy.io/runs/<run_id>/logs" \
-H "Authorization: Bearer <token>"
# Cancel a stuck run
curl -X POST "https://api.dagy.io/runs/<run_id>/cancel" \
-H "Authorization: Bearer <token>"
Local Runs
For run_local() failures, check:
- Run log:
~/.dagy/runs/<run_id>/run.log - Task logs:
~/.dagy/runs/<run_id>/task_runs/<task_id>/task.log - Task metadata:
~/.dagy/runs/<run_id>/task_runs/<task_id>/metadata.json - Artifacts:
~/.dagy/artifacts/<run_id>/<task_id>/
Common failure patterns:
| Error | Cause | Resolution |
|---|---|---|
TaskTimeoutError | Task exceeded timeout_seconds | Increase timeout or optimize task |
TaskFailedError | Task raised an unhandled exception | Check task logs for stack trace |
DependencyCycleError | Circular dependencies in DAG | Review task depends_on graph |
ArtifactSerializationError | Task output not JSON-serializable | Return serializable data or use LocalArtifact |