Back to docs
Operations

Monitoring

Dagy provides health checks, notification channels, and alert rules for monitoring your pipelines in production.

Health Checks

The API exposes health check endpoints that verify connectivity to all infrastructure components:

curl https://api.dagy.io/health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "components": [
    {"name": "dynamodb", "status": "healthy", "message": "Connected"},
    {"name": "s3", "status": "healthy", "message": "Bucket configured"},
    {"name": "sqs", "status": "healthy", "message": "Queue configured"}
  ],
  "timestamp": "2026-01-15T10:30:00Z"
}

Component Status Values

StatusMeaning
healthyComponent is operational
degradedComponent is accessible but misconfigured (e.g., no table or bucket configured)
unhealthyComponent is unreachable or erroring

The overall status is healthy only if all components are healthy. Any degraded component downgrades the overall status to degraded, and any unhealthy component sets overall status to unhealthy.

Checked Components

ComponentWhat It Checks
DynamoDBFlows table is configured and accessible
S3Artifact bucket is configured
SQSEvents queue URL is configured

Notification Channels

Dagy supports four notification channel types for delivering alerts:

Channel TypeDescription
slackPosts messages to a Slack webhook URL
emailSends email notifications
webhookSends HTTP POST to a custom URL
pagerdutyCreates PagerDuty incidents

Managing Channels

Create a channel:

curl -X POST https://api.dagy.io/notifications/channels \
  -H "Authorization: Bearer <token>" \
  -d '{
    "channel_type": "slack",
    "name": "Pipeline Alerts",
    "config": {"webhook_url": "https://hooks.slack.com/services/..."}
  }'

List channels:

curl https://api.dagy.io/notifications/channels \
  -H "Authorization: Bearer <token>"

Channels are scoped to the organization. See the Notifications API Reference for full endpoint documentation.

Alert Rules

Alert rules define when to send notifications and through which channels.

Trigger Types

TriggerDescription
on_failureFire when a flow run fails
on_successFire when a flow run succeeds
on_sla_breachFire when a run exceeds the sla_seconds threshold
on_retryFire when a task retries

Creating Alert Rules

curl -X POST https://api.dagy.io/notifications/rules \
  -H "Authorization: Bearer <token>" \
  -d '{
    "name": "ETL Failure Alert",
    "trigger": "on_failure",
    "flow_name": "daily_etl",
    "channel_ids": ["ch_abc123"],
    "sla_seconds": null
  }'

SLA Monitoring

Set trigger: "on_sla_breach" with a sla_seconds value to alert when a run exceeds the expected duration:

curl -X POST https://api.dagy.io/notifications/rules \
  -H "Authorization: Bearer <token>" \
  -d '{
    "name": "ETL SLA Breach",
    "trigger": "on_sla_breach",
    "flow_name": "daily_etl",
    "channel_ids": ["ch_abc123"],
    "sla_seconds": 1800
  }'

Audit Logging

All mutation operations are recorded in the audit log with before/after snapshots:

{
  "org_id": "org_123",
  "event_time": "2026-01-15T10:30:00Z#abc123",
  "resource_type": "flow",
  "resource_id": "daily_etl",
  "action": "deploy",
  "actor_email": "user@example.com",
  "before_json": null,
  "after_json": {"version": "3", "deployment": "prod"},
  "ip_address": "203.0.113.1"
}

Query the audit trail:

curl "https://api.dagy.io/audit-logs?resource_type=flow&limit=50" \
  -H "Authorization: Bearer <token>"

Requires admin.audit permission. See Audit Logs API Reference for full details.

Local Run Monitoring

When using run_local(), Dagy records run events and metadata locally:

LocationContent
~/.dagy/runs/<run_id>/run.logTimestamped run log
~/.dagy/runs/<run_id>/metadata.jsonRun and task metadata
~/.dagy/runs/<run_id>/task_runs/<task_id>/Per-task logs and metadata
~/.dagy/dagy.duckdbHistorical run data

The local event recorder tracks task start/end, duration, status, and stdout/stderr capture for each attempt.