Operations

Scheduler Troubleshooting

This runbook covers:

missed scheduler ticks
schedule backlog growth
catchup behavior verification

Naming. Resource names are env-suffixed (NameFactory in infrastructure/dagy_stack.py). Substitute your environment (develop, staging, production) for <env>.

Use this for environments where:

schedules are stored in dagy-schedules-<env>
EventBridge rule dagy-scheduler-tick-<env> (rate(1 minute)) invokes Lambda dagy-scheduler-<env> via its live-<env> alias
flow execution is queued through dagy-events-queue-<env>

Quick Triage

Confirm these first:

EventBridge rule dagy-scheduler-tick-<env> is ENABLED
Lambda dagy-scheduler-<env> has recent invocations
SQS queue dagy-events-queue-<env> is receiving messages
DAGY_SCHEDULES is set for the scheduler Lambda

AWS CLI checks:

aws events describe-rule --name "dagy-scheduler-tick-<env>"
aws lambda get-function-configuration --function-name "dagy-scheduler-<env>"
aws sqs get-queue-attributes --queue-url "$DAGY_EVENTS_QUEUE_URL" --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

Scenario 1: Missed Ticks

Symptoms:

next_run_epoch is in the past but no new runs are created
no recent scheduler Lambda logs/invocations

Checks:

EventBridge rule state and target mapping
Lambda permission for events.amazonaws.com
scheduler Lambda errors in CloudWatch logs

Remediation:

Re-enable the rule if disabled.
Confirm target ARN points to the dagy-scheduler-<env> alias (live-<env>).
Re-deploy the CDK stack if permission/target drift exists.
Trigger one manual scheduler event to verify:

aws lambda invoke \
  --function-name "dagy-scheduler-<env>" \
  --payload '{"dagy_event_type":"scheduler_tick","max_due":100}' \
  /tmp/scheduler-tick-response.json
cat /tmp/scheduler-tick-response.json

Expected result payload:

status=processed
triggered greater than 0 when due schedules exist

Scenario 2: Backlog Growth

Symptoms:

queue depth keeps rising
due schedules are not draining
run creation lags behind expected cadence

Checks:

scheduler tick response counters: checked, triggered, failures
SQS metrics for visible/in-flight messages
API Lambda SQS consumer concurrency and error rate

Remediation:

Increase scheduler batch size in tick payload (max_due) temporarily.
Increase Lambda concurrency/capacity for API/SQS consumer.
Verify failed schedules have last_error populated and fix root cause.
Temporarily disable noisy schedules using POST /schedules with enabled=false.

Scenario 3: Catchup Behavior Issues

Catchup policy semantics:

none: after a trigger, next run is computed from current time; backlog is skipped.
all: after a trigger, next run advances from previous due timestamp; backlog is replayed.

Checks:

inspect schedule payload: catchup_policy, mode, next_run_epoch
compare expected next due time with actual value in schedule record

Remediation:

Set catchup_policy=none to stop backlog replay.
Set catchup_policy=all only when replay is desired and capacity is adequate.
For interval schedules, ensure interval_seconds is realistic for workload size.
For cron schedules, validate timezone and cron expression.

Temporary Safety Actions

Use these to stabilize production quickly:

Disable a schedule:
- Upsert same schedule_id with enabled=false
Reduce blast radius:
- lower max_due for scheduler tick payload
Manual controlled trigger:
- POST /schedules/{schedule_id}/trigger for one schedule at a time

Recovery Validation

After remediation, confirm:

scheduler Lambda runs every minute without error spikes
due schedules decrease over time
run creation rate aligns with schedule definitions
no repeated last_error updates for the same schedule