Back to docs
Operations
Scheduler Troubleshooting
This runbook covers:
- missed scheduler ticks
- schedule backlog growth
- catchup behavior verification
Use this for environments where:
- schedules are stored in
dagy-<environment>-schedules - EventBridge invokes
dagy-<environment>-schedulerevery minute - flow execution is queued through the environment-scoped events queue
Quick Triage
Confirm these first:
- EventBridge rule
dagy-<environment>-scheduler-tickisENABLED - Lambda
dagy-<environment>-schedulerhas recent invocations - SQS queue for the active environment is receiving messages
DAGY_SCHEDULESis set for the scheduler Lambda
AWS CLI checks:
aws events describe-rule --name "dagy-<environment>-scheduler-tick"
aws lambda get-function-configuration --function-name "dagy-<environment>-scheduler"
aws sqs get-queue-attributes --queue-url "$DAGY_EVENTS_QUEUE_URL" --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible
Scenario 1: Missed Ticks
Symptoms:
next_run_epochis in the past but no new runs are created- no recent scheduler Lambda logs/invocations
Checks:
- EventBridge rule state and target mapping
- Lambda permission for
events.amazonaws.com - scheduler Lambda errors in CloudWatch logs
Remediation:
- Re-enable the rule if disabled.
- Confirm target ARN points to
dagy-<environment>-scheduler. - Re-deploy the CDK stack if permission/target drift exists.
- Trigger one manual scheduler event to verify:
aws lambda invoke \
--function-name "dagy-<environment>-scheduler" \
--payload '{"dagy_event_type":"scheduler_tick","max_due":100}' \
/tmp/scheduler-tick-response.json
cat /tmp/scheduler-tick-response.json
Expected result payload:
status=processedtriggeredgreater than0when due schedules exist
Scenario 2: Backlog Growth
Symptoms:
- queue depth keeps rising
- due schedules are not draining
- run creation lags behind expected cadence
Checks:
- scheduler tick response counters:
checked,triggered,failures - SQS metrics for visible/in-flight messages
- API Lambda SQS consumer concurrency and error rate
Remediation:
- Increase scheduler batch size in tick payload (
max_due) temporarily. - Increase Lambda concurrency/capacity for API/SQS consumer.
- Verify failed schedules have
last_errorpopulated and fix root cause. - Temporarily disable noisy schedules using
POST /scheduleswithenabled=false.
Scenario 3: Catchup Behavior Issues
Catchup policy semantics:
none: after a trigger, next run is computed from current time; backlog is skipped.all: after a trigger, next run advances from previous due timestamp; backlog is replayed.
Checks:
- inspect schedule payload:
catchup_policy,mode,next_run_epoch - compare expected next due time with actual value in schedule record
Remediation:
- Set
catchup_policy=noneto stop backlog replay. - Set
catchup_policy=allonly when replay is desired and capacity is adequate. - For interval schedules, ensure
interval_secondsis realistic for workload size. - For cron schedules, validate timezone and cron expression.
Temporary Safety Actions
Use these to stabilize production quickly:
- Disable a schedule:
- Upsert same
schedule_idwithenabled=false
- Upsert same
- Reduce blast radius:
- lower
max_duefor scheduler tick payload
- lower
- Manual controlled trigger:
POST /schedules/{schedule_id}/triggerfor one schedule at a time
Recovery Validation
After remediation, confirm:
- scheduler Lambda runs every minute without error spikes
- due schedules decrease over time
- run creation rate aligns with schedule definitions
- no repeated
last_errorupdates for the same schedule