Operations

ECS Fargate Operations Runbook

The Dagy ECS infrastructure is defined in `infrastructure/dagy_stack.py` using AWS CDK. Before deploying, configure your stack settings in the StackConfig dataclass.

Deployment Notes

CDK Deployment

The Dagy ECS infrastructure is defined in infrastructure/dagy_stack.py using AWS CDK. Before deploying, configure your stack settings in the StackConfig dataclass.

To deploy the stack:

cd infrastructure
pip install -r requirements.txt
cdk synth  # Synthesize CloudFormation template
cdk deploy --require-approval=never  # Deploy to your AWS account

The CDK deployment provisions the following key resources:

ECS Fargate cluster with Fargate and Fargate Spot capacity providers
DynamoDB tables (DAGY_RUNS, DAGY_TASK_RUNS, DAGY_FLOWS, DAGY_DEPLOYMENTS, DAGY_DEP_PACKAGES)
S3 buckets for artifacts and exception traces
Lambda functions for dag_launcher_handler and reconciler_handler
EventBridge rules for scheduled reconciliation
CloudWatch log groups
IAM roles and policies for proper privilege separation

The stack outputs include environment variable values needed by Lambda and ECS. Capture these outputs for use in environment configuration:

cdk deploy --require-approval=never | grep -E "^\w+.*=" > deployment.env
source deployment.env

Image Publishing

The dagy-worker Docker image must be built and pushed to ECR before ECS tasks can run.

Build the image:

docker build -f Dockerfile.worker -t dagy-worker:latest .
docker tag dagy-worker:latest dagy-worker:$(git rev-parse --short HEAD)  # Tag with commit hash

Authenticate Docker to ECR:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com

Push the image:

docker push $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com/dagy-worker:latest
docker push $(aws sts get-caller-identity --query Account --output text).dkr.ecr.us-east-1.amazonaws.com/dagy-worker:$(git rev-parse --short HEAD)

Update the Lambda environment variable DAGY_ECS_WORKER_IMAGE to point to the new image:

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=us-east-1
IMAGE_URI="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/dagy-worker:latest"

aws lambda update-function-configuration \
  --function-name dag-launcher \
  --environment "Variables={DAGY_ECS_WORKER_IMAGE=${IMAGE_URI}}"

For safe rollout, test the new image in staging before updating production. Consider using git tags for release versions (dagy-worker:v1.2.3) and test each version before promoting to latest.

Configuration Management

Critical configuration should be managed via AWS Systems Manager Parameter Store or Secrets Manager to avoid hardcoding in Lambda environment variables.

For parameters accessible at Lambda startup:

import boto3
ssm = boto3.client("ssm")

cluster_arn = ssm.get_parameter(
    Name="/dagy/ecs/cluster-arn",
    WithDecryption=False
)["Parameter"]["Value"]

For secrets (API keys, database passwords):

secrets_client = boto3.client("secretsmanager")

db_password = secrets_client.get_secret_value(
    SecretId="prod/database/password"
)["SecretString"]

Keep the number of Lambda environment variables reasonable. Use Parameter Store for frequently changing values and Secrets Manager for sensitive data.

Monitoring ECS Tasks

CloudWatch Metrics

ECS publishes metrics to CloudWatch under the AWS/ECS namespace. Key metrics for Dagy workloads:

RunCount: Number of running tasks
TaskFailure: Count of tasks that exited with non-zero code
DeploymentDesiredCount, DeploymentPendingCount, DeploymentRunningCount: Task launch progress
CPUUtilization, MemoryUtilization: Resource usage per task

Query metrics programmatically:

import boto3
import datetime

cloudwatch = boto3.client("cloudwatch")

response = cloudwatch.get_metric_statistics(
    Namespace="AWS/ECS",
    MetricName="CPUUtilization",
    Dimensions=[
        {"Name": "ServiceName", "Value": "dagy-worker"},
        {"Name": "ClusterName", "Value": "dagy"}
    ],
    StartTime=datetime.datetime.utcnow() - datetime.timedelta(hours=1),
    EndTime=datetime.datetime.utcnow(),
    Period=300,  # 5-minute granularity
    Statistics=["Average", "Maximum"]
)

for datapoint in response["Datapoints"]:
    print(f"{datapoint['Timestamp']}: CPU {datapoint['Average']:.1f}%")

CloudWatch Logs and Insights

All ECS worker logs are written to /ecs/dagy-worker in JSON format, enabling structured querying.

Common CloudWatch Insights queries:

Show all failed runs in the last 24 hours:

fields @timestamp, run_id, flow_name, error
| filter status = "FAILED"
| stats count() by flow_name

Show the distribution of run durations:

fields @timestamp, run_id, duration_seconds
| filter ispresent(duration_seconds)
| stats avg(duration_seconds), max(duration_seconds), pct(duration_seconds, 95) by flow_name

Identify tasks that ran out of memory:

fields @timestamp, run_id, task_name, message
| filter message like /OOM|out of memory|MemoryError/

Find the slowest task runs:

fields @timestamp, task_run_id, task_name, duration_seconds
| filter event_type = "task_end"
| sort duration_seconds desc
| limit 20

Create a CloudWatch alarm for high task failure rate:

aws cloudwatch put-metric-alarm \
  --alarm-name dagy-high-failure-rate \
  --alarm-description "Alert when ECS task failure rate exceeds 10%" \
  --metric-name TaskFailure \
  --namespace AWS/ECS \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 2 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:dagy-alerts"

Tracing and Correlation IDs

Every run has a correlation_id that ties together multiple logs and metrics. To trace a specific execution:

RUN_ID="550e8400-e29b-41d4-a716-446655440000"

aws logs get-log-events \
  --log-group-name /ecs/dagy-worker \
  --log-stream-name "${RUN_ID}" | jq '.events[].message | fromjson | .{timestamp, event_type, message}'

For batch launches, the parent correlation_id spans all child runs:

PARENT_CORRELATION="batch-12345"

aws logs start-query \
  --log-group-name /ecs/dagy-worker \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string "fields @timestamp, run_id, event_type, message | filter correlation_id like /${PARENT_CORRELATION}/" \
  --output json

ECS Status Reconciler

The status reconciler is a Lambda function that periodically checks for ECS tasks that are stuck in RUNNING status and reconciles them against the actual ECS task state.

How It Works

The reconciler function reconcile_ecs_runs() in src/dagy_api/ecs/reconciler.py performs these steps:

Query DynamoDB for all runs with status=RUNNING and executor=ecs
Filter to runs older than stale_threshold_seconds (default 300 seconds / 5 minutes) to avoid racing recently-launched tasks
For each run, extract the external_id (task ARN) and call describe_tasks()
If the ECS task is in a terminal state (STOPPED, etc.), check the exit code and update the run status accordingly
Return a summary of how many runs were checked and updated

The reconciler is idempotent; running it multiple times on the same data produces the same result.

Scheduling the Reconciler

Use EventBridge to invoke the reconciler on a schedule:

aws events put-rule \
  --name dagy-reconciler \
  --schedule-expression "rate(5 minutes)" \
  --state ENABLED

aws events put-targets \
  --rule dagy-reconciler \
  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789:function:ecs-reconciler","RoleArn"="arn:aws:iam::123456789:role/EventBridgeInvokeRole"

Configure the input event to customize behavior:

aws events put-targets \
  --rule dagy-reconciler \
  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:123456789:function:ecs-reconciler","Input"='{"max_runs":100,"stale_threshold_seconds":300}'

Parameters:

max_runs: Maximum number of runs to check per invocation (default 50). Higher values check more, but take longer.
stale_threshold_seconds: Age threshold in seconds (default 300). Only runs older than this are reconciled.

A typical configuration checks every 5 minutes with a 5-minute staleness threshold, catching task failures within 10 minutes.

Monitoring the Reconciler

Query the reconciler Lambda logs to monitor its health:

aws logs get-log-events \
  --log-group-name /aws/lambda/ecs-reconciler \
  --start-time $(date -d '1 hour ago' +%s)000 \
  --query 'events[].message' | jq -r '.[] | select(contains("Reconciliation complete")) | fromjson'

Set up an alarm for reconciler failures:

aws cloudwatch put-metric-alarm \
  --alarm-name dagy-reconciler-failure \
  --alarm-description "Alert when reconciler Lambda fails" \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --statistic Sum \
  --period 300 \
  --evaluation-periods 1 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions "Name"="FunctionName","Value"="ecs-reconciler" \
  --alarm-actions "arn:aws:sns:us-east-1:123456789:dagy-alerts"

Troubleshooting Common Failures

Task Launch Failures

Symptom: DAG launch succeeds (returns run_id), but the run immediately transitions to FAILED status with error "ECS launch failed".

Causes and remedies:

Insufficient capacity: The cluster has no available Fargate capacity.
- Check cluster capacity: aws ecs describe-clusters --clusters dagy
- Check the capacityProviders and registeredCapacityProviders fields
- Scale up the cluster or adjust the launch request to use smaller resources
Invalid CPU/memory combination: The requested combination is not valid for Fargate.
- Verify the cpu and memory are in the WORKLOAD_PROFILES list or match valid combinations
- Recall that 256 CPU only supports 512/1024/2048 MB memory; larger CPU values support higher memory
Security group or subnet misconfiguration: The cluster subnets or security groups are misconfigured.
- Verify subnets exist and are active: aws ec2 describe-subnets --subnet-ids <subnet-id>
- Verify security groups exist and allow outbound HTTPS: aws ec2 describe-security-groups --group-ids <sg-id>
- Check that security groups allow outbound to S3, DynamoDB, CloudWatch Logs
IAM role ARNs are incorrect: The task execution role or task role ARN is invalid or doesn't exist.
- Verify the role exists: aws iam get-role --role-name ecs-exec-role
- Verify the role has permissions for ECS task execution

Out of Memory (OOM) Kills

Symptom: Task status transitions to FAILED with error "Container was killed" or "Task stopped due to memory exhaustion".

Causes and remedies:

Insufficient memory for the workload: The flow loads too much data into memory.
- Increase memory: launcher.launch(..., workload_profile="xlarge")
- Rewrite the flow to use streaming or chunking instead of loading all data at once
- Check which task consumes the most memory by examining logs for task-specific memory usage
Memory leak in custom code: A task holds references to growing data structures.
- Profile the code with tools like memory_profiler to identify allocations
- Use generators instead of building entire lists in memory
- Explicitly delete large objects when no longer needed
Dependency packages are too large: Unpacking all dependencies exceeds available memory.
- Reduce the size of dependency packages by excluding test files, documentation, etc.
- Use a separate, slimmer dependency package for production
- Consider pre-installing common dependencies in the Docker image instead of as runtime packages

Mitigation: Increase ephemeral_storage and memory simultaneously:

launcher.launch(
    flow_name="memory_intensive",
    org_id="acme-corp",
    memory="8192",
    ephemeral_storage_gib=100,
)

Monitor memory usage in logs. The ECS container runtime emits memory statistics that can be extracted via CloudWatch Insights.

Artifact Download Errors

Symptom: Task logs show "Downloading flow artifact" followed by an error like "NoSuchKey" or "AccessDenied".

Causes and remedies:

Artifact S3 URI is invalid or points to non-existent object:
- Verify the artifact exists: aws s3 ls s3://artifact-bucket/path/to/artifact.zip
- Verify the flow record in DynamoDB has the correct artifact_s3_uri
- Re-upload the artifact if it was deleted
Task role lacks S3 permissions:
- Verify the ECS task role has s3:GetObject permission on the artifact bucket
- Check the IAM policy: aws iam get-role-policy --role-name ecs-task-role --policy-name dagy-s3-access
- Ensure the policy includes the artifact bucket ARN
S3 bucket policy blocks access:
- Check the bucket policy: aws s3api get-bucket-policy --bucket artifact-bucket
- Ensure the policy allows the ECS task role to get objects
- If using bucket encryption, ensure the KMS key allows the task role to decrypt
Network connectivity issues:
- Verify the ECS subnet is public and routes to the internet gateway
- Verify the task was assigned a public IP
- Check security group egress rules allow HTTPS to S3

Dependency Package Installation Failures

Symptom: Task logs show "Installing dependency package" followed by an error like "pip install failed" or "Extract failed".

Causes and remedies:

Dependency package is corrupted or missing:
- Verify the package exists in S3: aws s3 ls s3://artifact-bucket/dep-packages/
- Try to extract the package locally: unzip package.zip or tar -tzf package.tar.gz
- Re-create and re-upload the package
requirements.txt references packages not available in PyPI:
- Check the requirements.txt file in the package
- Ensure all packages are installable with pip (private repositories need credentials)
- Pre-build and package wheels for faster installation
Native extension compilation fails:
- Some packages have C extensions that require build tools
- The Dockerfile.worker includes gcc and build essentials, but some packages may need additional system libraries
- Either pre-build the wheels or add system dependencies to the Dockerfile
Insufficient disk space:
- pip caches downloaded packages; with large dependency packages, this can fill the 21 GiB ephemeral storage
- Override ephemeral_storage_gib to 50+ GiB for large packages
- Or, pre-install common packages in the Docker image to reduce package size

Dependency Resolution and Import Errors

Symptom: Task logs show "Could not resolve callable for task" or "ImportError: No module named 'custom_module'".

Causes and remedies:

Dependency package not included in deployment:
- Verify the deployment_name is correct and the deployment includes the needed dep_package_slugs
- Verify the dep_package_slug is registered and has a valid package_s3_uri
Dependency package structure is incorrect:
- Dependency packages should be ZIP files with the top-level directory as the package name
- Ensure init.py exists in package directories
- Test extraction locally: unzip package.zip and check the structure
Custom module import path is wrong:
- Verify the import_path in the FlowSpec matches the actual module structure
- If using a custom module from a dependency package, the package must be in sys.path
Flow source file is missing or unpacked incorrectly:
- The artifact should contain the flow's .py files
- Verify artifact integrity: unzip -t artifact.zip
- Re-package the flow if files are missing

Debug: Enable verbose logging to trace module resolution:

launcher.launch(
    flow_name="problematic_flow",
    org_id="acme-corp",
    extra_env_vars={"DAGY_LOG_LEVEL": "DEBUG"},
)

Task Execution Timeouts

Symptom: Task runs successfully for a while, then transitions to FAILED with "timeout" or "exceeded timeout".

Causes and remedies:

Task timeout is too short for the workload:
- Check the TaskSpec timeout_seconds in the flow definition
- Increase it for long-running tasks: update the flow and re-package the artifact
- Note that ECS has no hard timeout; tasks can run indefinitely
Specific task has no retries and fails sporadically:
- The task may be timing out due to external dependency slowness (database, API calls)
- Configure retries in the TaskSpec: @task(retries=3, retry_delay_seconds=5)
- Add exponential backoff or jitter to avoid thundering herd
Upstream dependency failure blocks progress:
- A task upstream timed out, and downstream tasks are skipped
- Trace the log to find which task timed out first

Stuck Runs in RUNNING Status

Symptom: A run remains in RUNNING status for hours, with no recent log entries. The ECS task is actually STOPPED.

Causes: The worker crashed without updating the run status (OOM kill, spot interruption, infrastructure failure). The reconciler should have caught and fixed this, but may not have run recently.

Remedies:

Manual reconciliation: Invoke the reconciler Lambda directly:

aws lambda invoke \
  --function-name ecs-reconciler \
  --payload '{"max_runs":100,"stale_threshold_seconds":0}' \
  /tmp/response.json

Check reconciler is running: Verify the EventBridge rule is enabled and the Lambda has execution permissions:
```
aws events describe-rule --name dagy-reconciler
aws logs tail /aws/lambda/ecs-reconciler --follow
```

Manually update the run: If urgent, directly update DynamoDB:

from dagy_api.persistence.models import RunModel
run = RunModel.get(run_id)
run.update(actions=[
    RunModel.status.set("FAILED"),
    RunModel.completed_at.set(datetime.datetime.utcnow().isoformat()),
    RunModel.error_message.set("Reconciled: Task was STOPPED"),
])

Executor Mismatch Failures

Symptom: Run transitions to FAILED with error "Requested backend 'ecs' is not available. Ensure DAGY_ECS_CLUSTER_ARN is set."

Cause: A user selected "ecs" as the executor in the UI or API, but the ECS backend is not configured in the environment. The system now fails explicitly instead of silently falling back to Lambda.

Remedies:

Verify ECS environment variables: Ensure DAGY_ECS_CLUSTER_ARN, DAGY_ECS_EXECUTION_ROLE_ARN, DAGY_ECS_TASK_ROLE_ARN, DAGY_ECS_SUBNETS, and DAGY_ECS_SECURITY_GROUPS are all set in the Lambda environment.
Check CDK deployment: Run cdk diff to confirm all ECS resources were deployed.
API-level validation: The API now validates executor availability at request time and returns HTTP 400 if the requested executor is not registered. Check the API response for details.

Graceful Shutdown (SIGTERM)

When ECS stops a task (via stop_task(), scaling events, or deployments), the container receives a SIGTERM signal before SIGKILL. The worker handles this gracefully:

Sets a shutdown flag immediately on SIGTERM
Waits up to 15 seconds for currently inflight tasks to complete
Marks all remaining pending tasks as CANCELLED in DynamoDB
Raises RuntimeError to trigger the FAILED status path
The run status becomes FAILED with error "Worker shutdown requested via SIGTERM"

Important: ECS default stop timeout is 30 seconds. The worker's 15-second drain window fits within this. If you need longer drain time, increase the ECS stopTimeout in the task definition.

Best practice: Design tasks to be idempotent so that a SIGTERM-interrupted run can be safely retried.

Retry Executor Preservation

When a run is retried, the system now preserves the original executor choice. If a run originally executed on ECS and is retried, the retry will also route to ECS (not fall back to Lambda).

If the executor is no longer available at retry time, the run will fail with an explicit error message rather than silently running on a different backend.

Scaling Considerations

Concurrent Task Limits

ECS clusters have a concurrent task limit based on capacity. The default configuration uses Fargate and Fargate Spot:

Fargate capacity is unlimited within AWS account quotas (default 1000 concurrent tasks)
Fargate Spot capacity is also unlimited but subject to interruption

To increase limits, request a quota increase via AWS Support.

Check current running task count:

aws ecs list-tasks --cluster dagy | jq '.taskArns | length'

Cluster Scaling Strategy

For predictable workloads, use Fargate. For batch workloads that tolerate interruption, use Fargate Spot to save costs.

Configure capacity provider preferences:

aws ecs create-cluster \
  --cluster-name dagy \
  --capacity-providers FARGATE FARGATE_SPOT \
  --default-capacity-provider-strategy capacityProvider=FARGATE_SPOT,weight=70 capacityProvider=FARGATE,weight=30

This allocates 70% of tasks to Spot and 30% to on-demand Fargate.

For mission-critical DAGs that cannot tolerate interruption, override the default:

launcher.launch(
    flow_name="critical_flow",
    org_id="acme-corp",
    extra_env_vars={"ECS_CAPACITY_PROVIDER": "FARGATE"},  # Custom, requires EcsBackend modification
)

Cost Optimization

Right-size resource profiles: Use the smallest profile that fits the workload.
- Profile small (256 CPU / 512 MB): ~$0.005/hour
- Profile medium (1024 CPU / 2048 MB): ~$0.020/hour
- Profile large (2048 CPU / 4096 MB): ~$0.040/hour
Use Fargate Spot for batch workloads: Saves up to 70% on compute.
- Suitable for workflows that can be retried
- Not suitable for real-time or SLA-critical flows
Pre-build Docker image with common dependencies: Reduces package download and installation time.
- Speeds up task startup
- Reduces ephemeral storage and memory usage
Keep the NAT-free network path: Fargate tasks use public IPs with a no-ingress security group, while Lambdas use AWS service networking.
- Do not add inbound rules unless a worker explicitly serves network traffic.
- Consider VPC endpoints only if workers later move back to private subnets.

Monitor per-flow costs: Track which flows consume the most resources.

from dagy_api.persistence.models import RunModel
runs = RunModel.query("flow_name")  # Query a specific flow
total_cost = sum(run.estimated_cost_usd for run in runs if run.status == "SUCCEEDED")

Rollback Procedures

Rolling Back Container Image

If a new dagy-worker image is broken, revert to the previous image:

# Tag the previous image as latest
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=us-east-1
REPO="${ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/dagy-worker"

docker pull ${REPO}:v1.0.0  # Previous working version
docker tag ${REPO}:v1.0.0 ${REPO}:latest
docker push ${REPO}:latest

# Update Lambda environment variable
aws lambda update-function-configuration \
  --function-name dag-launcher \
  --environment "Variables={DAGY_ECS_WORKER_IMAGE=${REPO}:latest}"

New task launches will use the reverted image. Existing running tasks continue with the old image.

Rolling Back Infrastructure Changes

If a CDK deployment introduces a breaking change, revert using git and redeploy:

cd infrastructure
git checkout HEAD~1  # Revert to previous commit
cdk deploy --require-approval=never

CDK will diff the template and apply only the necessary changes to revert. Some resources (e.g., DynamoDB tables) cannot be easily rolled back; data is preserved.

For critical fixes, use stack policies to prevent accidental deletion of data resources.

Rolling Back Configuration

If Lambda environment variables are misconfigured:

aws lambda update-function-configuration \
  --function-name dag-launcher \
  --environment "Variables={DAGY_ECS_CLUSTER_ARN=arn:aws:ecs:us-east-1:123456789:cluster/dagy-old}"

Store critical configuration in Parameter Store with version history for easy rollback:

aws ssm put-parameter \
  --name /dagy/ecs/cluster-arn \
  --value "arn:aws:ecs:us-east-1:123456789:cluster/dagy" \
  --overwrite

Disaster Recovery

Data Loss Prevention

DynamoDB tables are critical for run tracking. Enable point-in-time recovery:

aws dynamodb update-continuous-backups \
  --table-name DAGY_RUNS \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

aws dynamodb update-continuous-backups \
  --table-name DAGY_FLOWS \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

This allows recovery of deleted or corrupted records within the last 35 days.

Artifacts in S3 should be versioned:

aws s3api put-bucket-versioning \
  --bucket dagy-artifacts \
  --versioning-configuration Status=Enabled

Set a lifecycle policy to delete old versions after 90 days to control costs:

aws s3api put-bucket-lifecycle-configuration \
  --bucket dagy-artifacts \
  --lifecycle-configuration '{
    "Rules": [
      {
        "Id": "delete-old-versions",
        "Status": "Enabled",
        "NoncurrentVersionExpiration": {"NoncurrentDays": 90}
      }
    ]
  }'

Backup and Restore

Export DynamoDB table data regularly:

aws dynamodb export-table-to-point-in-time \
  --table-arn arn:aws:dynamodb:us-east-1:123456789:table/DAGY_RUNS \
  --s3-bucket dagy-backups \
  --s3-prefix runs-backup/

This creates a backup in Parquet format that can be queried with Athena or imported back if needed.

Incident Response

In case of widespread service failure:

Check AWS status page for regional outages
Verify IAM roles and policies haven't been modified
Check DynamoDB table capacity (bursting may be exhausted)
Verify S3 buckets are accessible and have not been accidentally deleted
Check CloudWatch alarms for recent errors
Review Lambda and ECS logs for exceptions
If necessary, trigger a manual reconciliation to catch any stale states
Post-incident, review logs and increase monitoring for the identified failure mode

This runbook covers the most common operational tasks. For additional support, consult the architecture document, user guide, or AWS documentation for ECS, Lambda, and DynamoDB.