Back to docs
Guides

Troubleshooting Guide

Common issues and their resolutions when working with the Dagy platform.


Flow Executor Issues

"Flow Executor backend was requested but is not available"

Error message:

Flow Executor backend was requested (execution_mode='micro') but is not available.
Ensure the DAGY_FLOW_EXECUTOR_FUNCTION environment variable is set.

Cause: The run's execution_mode is set to nano or micro, which routes to the flow-executor backend. However, the Flow Executor Lambda is not registered because DAGY_FLOW_EXECUTOR_FUNCTION is not set on the API Lambda.

Resolution:

  1. Deploy the Flow Executor Lambda using the CDK stack (it is provisioned automatically as {app}-flow-executor-{environment}).
  2. Verify the DAGY_FLOW_EXECUTOR_FUNCTION environment variable is set on both the API Lambda and the Flow Executor Lambda itself (the CDK stack handles this via common_env).
  3. After deploying, the BackendRegistry will register the flow-executor backend on startup and routing will succeed.

Import errors after deployment (missing dependency packages)

Symptoms: The flow starts but fails with ModuleNotFoundError for packages that should be available (e.g., pandas, numpy).

Cause: Dependency packages are configured on the deployment but either the packages don't exist, or the S3 URIs can't be resolved.

Resolution:

  1. Verify the deployment has the correct dep_package_slugs set:
    dagy deployments show my-deployment
    
  2. Ensure the dependency packages exist and are in ACTIVE status:
    dagy dep-packages list
    
  3. Check that the Flow Executor Lambda's IAM role has s3:GetObject permission on the dagy/dep-packages/* prefix in the artifacts bucket.
  4. Check CloudWatch logs for the Flow Executor Lambda for download errors.

Per-task invocation can't find upstream output

Symptoms: In micro runtime tier, a downstream task fails with errors like KeyError or None when trying to access output from an upstream task.

Cause: Per-task invocation mode persists each task's output to S3 and loads it for downstream consumers. If S3 permissions are misconfigured or the output was too large, the handoff fails.

Resolution:

  1. Verify the Flow Executor Lambda's IAM role has both s3:GetObject and s3:PutObject on the dagy/task-state/* prefix.
  2. Check CloudWatch logs for S3 upload/download errors between task invocations.
  3. Ensure upstream tasks return serializable output (JSON-safe dicts, lists, strings, numbers).
  4. For large outputs, verify the Lambda's memory is sufficient. Task state serialization happens in-memory.

Flow times out (15-minute Lambda limit)

Symptoms: The run status shows FAILED with a timeout error or the Lambda times out before all tasks complete.

Cause: Lambda functions have a maximum execution time of 900 seconds (15 minutes). If your flow's total execution exceeds this, the invocation is terminated.

Resolution:

  1. Use per-task invocation: Set execution_mode to micro so each task runs in its own 15-minute invocation rather than sharing one:
    dagy deployments settings my-deployment --execution-mode micro
    
  2. Use an alternative runtime tier: For flows that need more than 15 minutes per task, consider larger tiers (small, medium, large, xlarge) which support longer durations:
    dagy deployments settings my-deployment --execution-mode large
    
  3. Optimize tasks: Break large tasks into smaller units to fit within the invocation timeout.

Routing Issues

Flow runs on the wrong backend

Symptoms: A flow configured with execution_mode runs on the plain lambda backend instead of flow-executor, missing workspace bootstrap and dependency packages.

Cause: This was a known issue (C1 in the gap analysis) that has been fixed. With the fix, the system now fails explicitly instead of silently falling through. If you still see silent fallthrough, ensure you are running the latest version.

Resolution:

  1. Verify DAGY_FLOW_EXECUTOR_FUNCTION is set (see first section above).
  2. Check the run record's executor field to confirm which backend was used:
    dagy runs show <run_id>
    
  3. If the executor shows lambda instead of flow-executor, check that execution_mode is set on the deployment or run.

Scheduled/triggered runs ignore execution_mode

Symptoms: Runs triggered by the scheduler or via flow_trigger events don't use the expected execution mode.

Cause: This was a known issue (C2 in the gap analysis) where flow_trigger events did not forward execution_mode. This has been fixed. The execution_mode is now included in the RunRequest and the run_execute event payload.

Resolution: Ensure you are running the latest version. If the issue persists, check:

  1. The deployment record has execution_mode set.
  2. The schedule's trigger event includes execution_mode in the payload.
  3. CloudWatch logs for the event handler show execution_mode being passed through.

Retried runs lose execution_mode

Symptoms: A run that is retried ends up on a different backend than the original run.

Cause: This was a known issue (C3 in the gap analysis) where run_retry events did not forward execution_mode. This has been fixed. The retry handler now looks up the run record and includes execution_mode in the retry event.

Resolution: Ensure you are running the latest version. The fix reads execution_mode from the original run record when constructing the retry event.


Deployment Issues

"Invalid execution_mode" validation error

Symptoms: Deploy or settings update fails with Invalid execution_mode: <value>. Must be one of {'nano', 'micro', 'small', 'medium', 'large', 'xlarge'}.

Cause: The execution_mode field only accepts valid runtime tier values.

Resolution: Use one of the valid values:

  • nano (default): entire flow runs in a single invocation
  • micro: each task runs in a separate invocation
  • small, medium, large, xlarge: larger runtime tiers with extended execution times

CLI Issues

"Failed to trigger run" error

Symptoms: dagy run my-deployment fails with an API error.

Resolution:

  1. Verify the API URL is configured: dagy config show
  2. Ensure your token is valid: dagy login
  3. Check the deployment exists and is in ACTIVE status
  4. If using --execution-mode, ensure the value is one of nano, micro, small, medium, large, or xlarge