Back to docs
Operations

Scheduler Troubleshooting

This runbook covers:

  • missed scheduler ticks
  • schedule backlog growth
  • catchup behavior verification

Use this for environments where:

  • schedules are stored in dagy-<environment>-schedules
  • EventBridge invokes dagy-<environment>-scheduler every minute
  • flow execution is queued through the environment-scoped events queue

Quick Triage

Confirm these first:

  • EventBridge rule dagy-<environment>-scheduler-tick is ENABLED
  • Lambda dagy-<environment>-scheduler has recent invocations
  • SQS queue for the active environment is receiving messages
  • DAGY_SCHEDULES is set for the scheduler Lambda

AWS CLI checks:

aws events describe-rule --name "dagy-<environment>-scheduler-tick"
aws lambda get-function-configuration --function-name "dagy-<environment>-scheduler"
aws sqs get-queue-attributes --queue-url "$DAGY_EVENTS_QUEUE_URL" --attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible

Scenario 1: Missed Ticks

Symptoms:

  • next_run_epoch is in the past but no new runs are created
  • no recent scheduler Lambda logs/invocations

Checks:

  • EventBridge rule state and target mapping
  • Lambda permission for events.amazonaws.com
  • scheduler Lambda errors in CloudWatch logs

Remediation:

  1. Re-enable the rule if disabled.
  2. Confirm target ARN points to dagy-<environment>-scheduler.
  3. Re-deploy the CDK stack if permission/target drift exists.
  4. Trigger one manual scheduler event to verify:
aws lambda invoke \
  --function-name "dagy-<environment>-scheduler" \
  --payload '{"dagy_event_type":"scheduler_tick","max_due":100}' \
  /tmp/scheduler-tick-response.json
cat /tmp/scheduler-tick-response.json

Expected result payload:

  • status=processed
  • triggered greater than 0 when due schedules exist

Scenario 2: Backlog Growth

Symptoms:

  • queue depth keeps rising
  • due schedules are not draining
  • run creation lags behind expected cadence

Checks:

  • scheduler tick response counters: checked, triggered, failures
  • SQS metrics for visible/in-flight messages
  • API Lambda SQS consumer concurrency and error rate

Remediation:

  1. Increase scheduler batch size in tick payload (max_due) temporarily.
  2. Increase Lambda concurrency/capacity for API/SQS consumer.
  3. Verify failed schedules have last_error populated and fix root cause.
  4. Temporarily disable noisy schedules using POST /schedules with enabled=false.

Scenario 3: Catchup Behavior Issues

Catchup policy semantics:

  • none: after a trigger, next run is computed from current time; backlog is skipped.
  • all: after a trigger, next run advances from previous due timestamp; backlog is replayed.

Checks:

  • inspect schedule payload: catchup_policy, mode, next_run_epoch
  • compare expected next due time with actual value in schedule record

Remediation:

  1. Set catchup_policy=none to stop backlog replay.
  2. Set catchup_policy=all only when replay is desired and capacity is adequate.
  3. For interval schedules, ensure interval_seconds is realistic for workload size.
  4. For cron schedules, validate timezone and cron expression.

Temporary Safety Actions

Use these to stabilize production quickly:

  • Disable a schedule:
    • Upsert same schedule_id with enabled=false
  • Reduce blast radius:
    • lower max_due for scheduler tick payload
  • Manual controlled trigger:
    • POST /schedules/{schedule_id}/trigger for one schedule at a time

Recovery Validation

After remediation, confirm:

  • scheduler Lambda runs every minute without error spikes
  • due schedules decrease over time
  • run creation rate aligns with schedule definitions
  • no repeated last_error updates for the same schedule