ECS Fargate Operations Runbook
The Dagy ECS infrastructure is defined in `infrastructure/dagy_stack.py` using AWS CDK. Before deploying, configure your stack settings in the StackConfig dataclass.
Deployment Notes
CDK Deployment
The Dagy ECS infrastructure is defined in infrastructure/dagy_stack.py using AWS CDK. Before deploying, configure your stack settings in the StackConfig dataclass.
To deploy the stack:
cd infrastructure
pip install -r requirements.txt
cdk synth # Synthesize CloudFormation template
cdk deploy --require-approval=never # Deploy to your AWS account
The CDK deployment provisions the following key resources:
- ECS Fargate cluster with Fargate and Fargate Spot capacity providers
- DynamoDB tables (DAGY_RUNS, DAGY_TASK_RUNS, DAGY_FLOWS, DAGY_DEPLOYMENTS, DAGY_DEP_PACKAGES)
- S3 buckets for artifacts and exception traces
- Lambda functions for dag_launcher_handler and reconciler_handler
- EventBridge rules for scheduled reconciliation
- CloudWatch log groups
- IAM roles and policies for proper privilege separation
The stack outputs include environment variable values needed by Lambda and ECS. Capture these outputs for use in environment configuration:
cdk deploy --require-approval=never | grep -E "^\w+.*=" > deployment.env
source deployment.env
Image Publishing
The dagy-worker Docker image must be built and pushed to ECR before ECS tasks can run.
Build the image:
docker build -f Dockerfile.worker -t dagy-worker:latest .
docker tag dagy-worker:latest dagy-worker:$(git rev-parse --short HEAD) # Tag with commit hash
Authenticate Docker to ECR:
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com
Push the image:
docker push $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com/dagy-worker:latest
docker push $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com/dagy-worker:$(git rev-parse --short HEAD)
Update the Lambda environment variable DAGY_ECS_WORKER_IMAGE to point to the new image:
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=us-east-1
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/dagy-worker:latest"
aws lambda update-function-configuration \
--function-name dag-launcher \
--environment "Variables={DAGY_ECS_WORKER_IMAGE=${IMAGE_URI}}"
For safe rollout, test the new image in staging before updating production. Consider using git tags for release versions (dagy-worker:v1.2.3) and test each version before promoting to latest.
Configuration Management
Critical configuration should be managed via AWS Systems Manager Parameter Store or Secrets Manager to avoid hardcoding in Lambda environment variables.
For parameters accessible at Lambda startup:
import boto3
ssm = boto3.client("ssm")
cluster_arn = ssm.get_parameter(
Name="/dagy/ecs/cluster-arn",
WithDecryption=False
)["Parameter"]["Value"]
For secrets (API keys, database passwords):
secrets_client = boto3.client("secretsmanager")
db_password = secrets_client.get_secret_value(
SecretId="prod/database/password"
)["SecretString"]
Keep the number of Lambda environment variables reasonable. Use Parameter Store for frequently changing values and Secrets Manager for sensitive data.
Monitoring ECS Tasks
CloudWatch Metrics
ECS publishes metrics to CloudWatch under the AWS/ECS namespace. Key metrics for Dagy workloads:
- RunCount: Number of running tasks
- TaskFailure: Count of tasks that exited with non-zero code
- DeploymentDesiredCount, DeploymentPendingCount, DeploymentRunningCount: Task launch progress
- CPUUtilization, MemoryUtilization: Resource usage per task
Query metrics programmatically:
import boto3
import datetime
cloudwatch = boto3.client("cloudwatch")
response = cloudwatch.get_metric_statistics(
Namespace="AWS/ECS",
MetricName="CPUUtilization",
Dimensions=[
{"Name": "ServiceName", "Value": "dagy-worker"},
{"Name": "ClusterName", "Value": "dagy"}
],
StartTime=datetime.datetime.utcnow() - datetime.timedelta(hours=1),
EndTime=datetime.datetime.utcnow(),
Period=300, # 5-minute granularity
Statistics=["Average", "Maximum"]
)
for datapoint in response["Datapoints"]:
print(f"{datapoint['Timestamp']}: CPU {datapoint['Average']:.1f}%")
CloudWatch Logs and Insights
All ECS worker logs are written to /ecs/dagy-worker in JSON format, enabling structured querying.
Common CloudWatch Insights queries:
Show all failed runs in the last 24 hours:
fields @timestamp, run_id, flow_name, error
| filter status = "FAILED"
| stats count() by flow_name
Show the distribution of run durations:
fields @timestamp, run_id, duration_seconds
| filter ispresent(duration_seconds)
| stats avg(duration_seconds), max(duration_seconds), pct(duration_seconds, 95) by flow_name
Identify tasks that ran out of memory:
fields @timestamp, run_id, task_name, message
| filter message like /OOM|out of memory|MemoryError/
Find the slowest task runs:
fields @timestamp, task_run_id, task_name, duration_seconds
| filter event_type = "task_end"
| sort duration_seconds desc
| limit 20
Create a CloudWatch alarm for high task failure rate:
aws cloudwatch put-metric-alarm \
--alarm-name dagy-high-failure-rate \
--alarm-description "Alert when ECS task failure rate exceeds 10%" \
--metric-name TaskFailure \
--namespace AWS/ECS \
--statistic Sum \
--period 300 \
--evaluation-periods 2 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--alarm-actions "arn:aws:sns:us-east-1:123456789:dagy-alerts"
Tracing and Correlation IDs
Every run has a correlation_id that ties together multiple logs and metrics. To trace a specific execution:
RUN_ID="550e8400-e29b-41d4-a716-446655440000"
aws logs get-log-events \
--log-group-name /ecs/dagy-worker \
--log-stream-name "${RUN_ID}" | jq '.events[].message | fromjson | .{timestamp, event_type, message}'
For batch launches, the parent correlation_id spans all child runs:
PARENT_CORRELATION="batch-12345"
aws logs start-query \
--log-group-name /ecs/dagy-worker \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string "fields @timestamp, run_id, event_type, message | filter correlation_id like /${PARENT_CORRELATION}/" \
--output json
ECS Status Reconciler
The status reconciler is a Lambda function that periodically checks for ECS tasks that are stuck in RUNNING status and reconciles them against the actual ECS task state.
How It Works
The reconciler function reconcile_ecs_runs() in src/dagy_api/ecs/reconciler.py performs these steps:
- Query DynamoDB for all runs with status=RUNNING and executor=ecs
- Filter to runs older than stale_threshold_seconds (default 300 seconds / 5 minutes) to avoid racing recently-launched tasks
- For each run, extract the external_id (task ARN) and call describe_tasks()
- If the ECS task is in a terminal state (STOPPED, etc.), check the exit code and update the run status accordingly
- Return a summary of how many runs were checked and updated
The reconciler is idempotent; running it multiple times on the same data produces the same result.
Scheduling the Reconciler
Use EventBridge to invoke the reconciler on a schedule:
aws events put-rule \
--name dagy-reconciler \
--schedule-expression "rate(5 minutes)" \
--state ENABLED
aws events put-targets \
--rule dagy-reconciler \
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789:function:ecs-reconciler","RoleArn"="arn:aws:iam::123456789:role/EventBridgeInvokeRole"
Configure the input event to customize behavior:
aws events put-targets \
--rule dagy-reconciler \
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789:function:ecs-reconciler","Input"='{"max_runs":100,"stale_threshold_seconds":300}'
Parameters:
- max_runs: Maximum number of runs to check per invocation (default 50). Higher values check more, but take longer.
- stale_threshold_seconds: Age threshold in seconds (default 300). Only runs older than this are reconciled.
A typical configuration checks every 5 minutes with a 5-minute staleness threshold, catching task failures within 10 minutes.
Monitoring the Reconciler
Query the reconciler Lambda logs to monitor its health:
aws logs get-log-events \
--log-group-name /aws/lambda/ecs-reconciler \
--start-time $(date -d '1 hour ago' +%s)000 \
--query 'events[].message' | jq -r '.[] | select(contains("Reconciliation complete")) | fromjson'
Set up an alarm for reconciler failures:
aws cloudwatch put-metric-alarm \
--alarm-name dagy-reconciler-failure \
--alarm-description "Alert when reconciler Lambda fails" \
--metric-name Errors \
--namespace AWS/Lambda \
--statistic Sum \
--period 300 \
--evaluation-periods 1 \
--threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold \
--dimensions "Name"="FunctionName","Value"="ecs-reconciler" \
--alarm-actions "arn:aws:sns:us-east-1:123456789:dagy-alerts"
Troubleshooting Common Failures
Task Launch Failures
Symptom: DAG launch succeeds (returns run_id), but the run immediately transitions to FAILED status with error "ECS launch failed".
Causes and remedies:
-
Insufficient capacity: The cluster has no available Fargate capacity.
- Check cluster capacity:
aws ecs describe-clusters --clusters dagy - Check the capacityProviders and registeredCapacityProviders fields
- Scale up the cluster or adjust the launch request to use smaller resources
- Check cluster capacity:
-
Invalid CPU/memory combination: The requested combination is not valid for Fargate.
- Verify the cpu and memory are in the WORKLOAD_PROFILES list or match valid combinations
- Recall that 256 CPU only supports 512/1024/2048 MB memory; larger CPU values support higher memory
-
Security group or subnet misconfiguration: The cluster subnets or security groups are misconfigured.
- Verify subnets exist and are active:
aws ec2 describe-subnets --subnet-ids <subnet-id> - Verify security groups exist and allow outbound HTTPS:
aws ec2 describe-security-groups --group-ids <sg-id> - Check that security groups allow outbound to S3, DynamoDB, CloudWatch Logs
- Verify subnets exist and are active:
-
IAM role ARNs are incorrect: The task execution role or task role ARN is invalid or doesn't exist.
- Verify the role exists:
aws iam get-role --role-name ecs-exec-role - Verify the role has permissions for ECS task execution
- Verify the role exists:
Out of Memory (OOM) Kills
Symptom: Task status transitions to FAILED with error "Container was killed" or "Task stopped due to memory exhaustion".
Causes and remedies:
-
Insufficient memory for the workload: The flow loads too much data into memory.
- Increase memory:
launcher.launch(..., workload_profile="xlarge") - Rewrite the flow to use streaming or chunking instead of loading all data at once
- Check which task consumes the most memory by examining logs for task-specific memory usage
- Increase memory:
-
Memory leak in custom code: A task holds references to growing data structures.
- Profile the code with tools like memory_profiler to identify allocations
- Use generators instead of building entire lists in memory
- Explicitly delete large objects when no longer needed
-
Dependency packages are too large: Unpacking all dependencies exceeds available memory.
- Reduce the size of dependency packages by excluding test files, documentation, etc.
- Use a separate, slimmer dependency package for production
- Consider pre-installing common dependencies in the Docker image instead of as runtime packages
Mitigation: Increase ephemeral_storage and memory simultaneously:
launcher.launch(
flow_name="memory_intensive",
org_id="acme-corp",
memory="8192",
ephemeral_storage_gib=100,
)
Monitor memory usage in logs. The ECS container runtime emits memory statistics that can be extracted via CloudWatch Insights.
Artifact Download Errors
Symptom: Task logs show "Downloading flow artifact" followed by an error like "NoSuchKey" or "AccessDenied".
Causes and remedies:
-
Artifact S3 URI is invalid or points to non-existent object:
- Verify the artifact exists:
aws s3 ls s3://artifact-bucket/path/to/artifact.zip - Verify the flow record in DynamoDB has the correct artifact_s3_uri
- Re-upload the artifact if it was deleted
- Verify the artifact exists:
-
Task role lacks S3 permissions:
- Verify the ECS task role has s3:GetObject permission on the artifact bucket
- Check the IAM policy:
aws iam get-role-policy --role-name ecs-task-role --policy-name dagy-s3-access - Ensure the policy includes the artifact bucket ARN
-
S3 bucket policy blocks access:
- Check the bucket policy:
aws s3api get-bucket-policy --bucket artifact-bucket - Ensure the policy allows the ECS task role to get objects
- If using bucket encryption, ensure the KMS key allows the task role to decrypt
- Check the bucket policy:
-
Network connectivity issues:
- Verify the ECS subnet can reach S3 via VPC endpoint or NAT Gateway
- Check security group egress rules allow HTTPS to S3
Dependency Package Installation Failures
Symptom: Task logs show "Installing dependency package" followed by an error like "pip install failed" or "Extract failed".
Causes and remedies:
-
Dependency package is corrupted or missing:
- Verify the package exists in S3:
aws s3 ls s3://artifact-bucket/dep-packages/ - Try to extract the package locally:
unzip package.ziportar -tzf package.tar.gz - Re-create and re-upload the package
- Verify the package exists in S3:
-
requirements.txt references packages not available in PyPI:
- Check the requirements.txt file in the package
- Ensure all packages are installable with pip (private repositories need credentials)
- Pre-build and package wheels for faster installation
-
Native extension compilation fails:
- Some packages have C extensions that require build tools
- The Dockerfile.worker includes gcc and build essentials, but some packages may need additional system libraries
- Either pre-build the wheels or add system dependencies to the Dockerfile
-
Insufficient disk space:
- pip caches downloaded packages; with large dependency packages, this can fill the 21 GiB ephemeral storage
- Override ephemeral_storage_gib to 50+ GiB for large packages
- Or, pre-install common packages in the Docker image to reduce package size
Dependency Resolution and Import Errors
Symptom: Task logs show "Could not resolve callable for task" or "ImportError: No module named 'custom_module'".
Causes and remedies:
-
Dependency package not included in deployment:
- Verify the deployment_name is correct and the deployment includes the needed dep_package_slugs
- Verify the dep_package_slug is registered and has a valid package_s3_uri
-
Dependency package structure is incorrect:
- Dependency packages should be ZIP files with the top-level directory as the package name
- Ensure init.py exists in package directories
- Test extraction locally:
unzip package.zipand check the structure
-
Custom module import path is wrong:
- Verify the import_path in the FlowSpec matches the actual module structure
- If using a custom module from a dependency package, the package must be in sys.path
-
Flow source file is missing or unpacked incorrectly:
- The artifact should contain the flow's .py files
- Verify artifact integrity:
unzip -t artifact.zip - Re-package the flow if files are missing
Debug: Enable verbose logging to trace module resolution:
launcher.launch(
flow_name="problematic_flow",
org_id="acme-corp",
extra_env_vars={"DAGY_LOG_LEVEL": "DEBUG"},
)
Task Execution Timeouts
Symptom: Task runs successfully for a while, then transitions to FAILED with "timeout" or "exceeded timeout".
Causes and remedies:
-
Task timeout is too short for the workload:
- Check the TaskSpec timeout_seconds in the flow definition
- Increase it for long-running tasks: update the flow and re-package the artifact
- Note that ECS has no hard timeout; tasks can run indefinitely
-
Specific task has no retries and fails sporadically:
- The task may be timing out due to external dependency slowness (database, API calls)
- Configure retries in the TaskSpec:
@task(retries=3, retry_delay_seconds=5) - Add exponential backoff or jitter to avoid thundering herd
-
Upstream dependency failure blocks progress:
- A task upstream timed out, and downstream tasks are skipped
- Trace the log to find which task timed out first
Stuck Runs in RUNNING Status
Symptom: A run remains in RUNNING status for hours, with no recent log entries. The ECS task is actually STOPPED.
Causes: The worker crashed without updating the run status (OOM kill, spot interruption, infrastructure failure). The reconciler should have caught and fixed this, but may not have run recently.
Remedies:
-
Manual reconciliation: Invoke the reconciler Lambda directly:
aws lambda invoke \ --function-name ecs-reconciler \ --payload '{"max_runs":100,"stale_threshold_seconds":0}' \ /tmp/response.json -
Check reconciler is running: Verify the EventBridge rule is enabled and the Lambda has execution permissions:
aws events describe-rule --name dagy-reconciler aws logs tail /aws/lambda/ecs-reconciler --follow -
Manually update the run: If urgent, directly update DynamoDB:
from dagy_api.persistence.models import RunModel run = RunModel.get(run_id) run.update(actions=[ RunModel.status.set("FAILED"), RunModel.completed_at.set(datetime.datetime.utcnow().isoformat()), RunModel.error_message.set("Reconciled: Task was STOPPED"), ])
Executor Mismatch Failures
Symptom: Run transitions to FAILED with error "Requested backend 'ecs' is not available. Ensure DAGY_ECS_CLUSTER_ARN is set."
Cause: A user selected "ecs" as the executor in the UI or API, but the ECS backend is not configured in the environment. The system now fails explicitly instead of silently falling back to Lambda.
Remedies:
- Verify ECS environment variables: Ensure
DAGY_ECS_CLUSTER_ARN,DAGY_ECS_EXECUTION_ROLE_ARN,DAGY_ECS_TASK_ROLE_ARN,DAGY_ECS_SUBNETS, andDAGY_ECS_SECURITY_GROUPSare all set in the Lambda environment. - Check CDK deployment: Run
cdk diffto confirm all ECS resources were deployed. - API-level validation: The API now validates executor availability at request time and returns HTTP 400 if the requested executor is not registered. Check the API response for details.
Graceful Shutdown (SIGTERM)
When ECS stops a task (via stop_task(), scaling events, or deployments), the container receives a SIGTERM signal before SIGKILL. The worker handles this gracefully:
- Sets a shutdown flag immediately on SIGTERM
- Waits up to 15 seconds for currently inflight tasks to complete
- Marks all remaining pending tasks as CANCELLED in DynamoDB
- Raises RuntimeError to trigger the FAILED status path
- The run status becomes FAILED with error "Worker shutdown requested via SIGTERM"
Important: ECS default stop timeout is 30 seconds. The worker's 15-second drain window fits within this. If you need longer drain time, increase the ECS stopTimeout in the task definition.
Best practice: Design tasks to be idempotent so that a SIGTERM-interrupted run can be safely retried.
Retry Executor Preservation
When a run is retried, the system now preserves the original executor choice. If a run originally executed on ECS and is retried, the retry will also route to ECS (not fall back to Lambda).
If the executor is no longer available at retry time, the run will fail with an explicit error message rather than silently running on a different backend.
Scaling Considerations
Concurrent Task Limits
ECS clusters have a concurrent task limit based on capacity. The default configuration uses Fargate and Fargate Spot:
- Fargate capacity is unlimited within AWS account quotas (default 1000 concurrent tasks)
- Fargate Spot capacity is also unlimited but subject to interruption
To increase limits, request a quota increase via AWS Support.
Check current running task count:
aws ecs list-tasks --cluster dagy | jq '.taskArns | length'
Cluster Scaling Strategy
For predictable workloads, use Fargate. For batch workloads that tolerate interruption, use Fargate Spot to save costs.
Configure capacity provider preferences:
aws ecs create-cluster \
--cluster-name dagy \
--capacity-providers FARGATE FARGATE_SPOT \
--default-capacity-provider-strategy capacityProvider=FARGATE_SPOT,weight=70 capacityProvider=FARGATE,weight=30
This allocates 70% of tasks to Spot and 30% to on-demand Fargate.
For mission-critical DAGs that cannot tolerate interruption, override the default:
launcher.launch(
flow_name="critical_flow",
org_id="acme-corp",
extra_env_vars={"ECS_CAPACITY_PROVIDER": "FARGATE"}, # Custom, requires EcsBackend modification
)
Cost Optimization
-
Right-size resource profiles: Use the smallest profile that fits the workload.
- Profile small (256 CPU / 512 MB): ~$0.005/hour
- Profile medium (1024 CPU / 2048 MB): ~$0.020/hour
- Profile large (2048 CPU / 4096 MB): ~$0.040/hour
-
Use Fargate Spot for batch workloads: Saves up to 70% on compute.
- Suitable for workflows that can be retried
- Not suitable for real-time or SLA-critical flows
-
Pre-build Docker image with common dependencies: Reduces package download and installation time.
- Speeds up task startup
- Reduces ephemeral storage and memory usage
-
Use VPC endpoints instead of NAT Gateway: Reduces data transfer costs for S3 and DynamoDB access.
- Gateway endpoint for S3 and DynamoDB (no charge)
- Interface endpoint for other services (small charge)
-
Monitor per-flow costs: Track which flows consume the most resources.
from dagy_api.persistence.models import RunModel runs = RunModel.query("flow_name") # Query a specific flow total_cost = sum(run.estimated_cost_usd for run in runs if run.status == "SUCCEEDED")
Rollback Procedures
Rolling Back Container Image
If a new dagy-worker image is broken, revert to the previous image:
# Tag the previous image as latest
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=us-east-1
REPO="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/dagy-worker"
docker pull ${REPO}:v1.0.0 # Previous working version
docker tag ${REPO}:v1.0.0 ${REPO}:latest
docker push ${REPO}:latest
# Update Lambda environment variable
aws lambda update-function-configuration \
--function-name dag-launcher \
--environment "Variables={DAGY_ECS_WORKER_IMAGE=${REPO}:latest}"
New task launches will use the reverted image. Existing running tasks continue with the old image.
Rolling Back Infrastructure Changes
If a CDK deployment introduces a breaking change, revert using git and redeploy:
cd infrastructure
git checkout HEAD~1 # Revert to previous commit
cdk deploy --require-approval=never
CDK will diff the template and apply only the necessary changes to revert. Some resources (e.g., DynamoDB tables) cannot be easily rolled back; data is preserved.
For critical fixes, use stack policies to prevent accidental deletion of data resources.
Rolling Back Configuration
If Lambda environment variables are misconfigured:
aws lambda update-function-configuration \
--function-name dag-launcher \
--environment "Variables={DAGY_ECS_CLUSTER_ARN=arn:aws:ecs:us-east-1:123456789:cluster/dagy-old}"
Store critical configuration in Parameter Store with version history for easy rollback:
aws ssm put-parameter \
--name /dagy/ecs/cluster-arn \
--value "arn:aws:ecs:us-east-1:123456789:cluster/dagy" \
--overwrite
Disaster Recovery
Data Loss Prevention
DynamoDB tables are critical for run tracking. Enable point-in-time recovery:
aws dynamodb update-continuous-backups \
--table-name DAGY_RUNS \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
aws dynamodb update-continuous-backups \
--table-name DAGY_FLOWS \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
This allows recovery of deleted or corrupted records within the last 35 days.
Artifacts in S3 should be versioned:
aws s3api put-bucket-versioning \
--bucket dagy-artifacts \
--versioning-configuration Status=Enabled
Set a lifecycle policy to delete old versions after 90 days to control costs:
aws s3api put-bucket-lifecycle-configuration \
--bucket dagy-artifacts \
--lifecycle-configuration '{
"Rules": [
{
"Id": "delete-old-versions",
"Status": "Enabled",
"NoncurrentVersionExpiration": {"NoncurrentDays": 90}
}
]
}'
Backup and Restore
Export DynamoDB table data regularly:
aws dynamodb export-table-to-point-in-time \
--table-arn arn:aws:dynamodb:us-east-1:123456789:table/DAGY_RUNS \
--s3-bucket dagy-backups \
--s3-prefix runs-backup/
This creates a backup in Parquet format that can be queried with Athena or imported back if needed.
Incident Response
In case of widespread service failure:
- Check AWS status page for regional outages
- Verify IAM roles and policies haven't been modified
- Check DynamoDB table capacity (bursting may be exhausted)
- Verify S3 buckets are accessible and have not been accidentally deleted
- Check CloudWatch alarms for recent errors
- Review Lambda and ECS logs for exceptions
- If necessary, trigger a manual reconciliation to catch any stale states
- Post-incident, review logs and increase monitoring for the identified failure mode
This runbook covers the most common operational tasks. For additional support, consult the architecture document, user guide, or AWS documentation for ECS, Lambda, and DynamoDB.