Guides

Dagy Self-Hosted Deployment Guide

This guide provides comprehensive instructions for deploying Dagy, a DAG orchestration platform, in a self-hosted AWS environment. It covers infrastructure deployment, configuration, and operational best practices.

Prerequisites
Architecture Overview
Infrastructure Deployment
Environment Variables Reference
Frontend Deployment
Backend Configuration
Security Configuration
Monitoring & Observability
Scaling & Performance
Backup & Disaster Recovery
Upgrading
Troubleshooting

Prerequisites

AWS Account Requirements

Active AWS account with appropriate IAM permissions
Access to AWS CloudFormation, Lambda, S3, SQS, EC2, ECS, and IAM services
EC2 key pair created (for ECS cluster access if needed)

Local Development Environment

Node.js 18+ (for frontend builds)
Python 3.11+ (for backend and CDK)
AWS CLI v2 configured with appropriate credentials

AWS CDK CLI v2+:

npm install -g aws-cdk
cdk --version  # Should be v2.x.x or higher

Docker 20.10+ (for building Lambda container images)
```
docker --version
docker login  # To push to ECR
```
Git for cloning the repository

uv (Python package manager - recommended):

curl -LsSf https://astral.sh/uv/install.sh | sh

External Services

Clerk Account (for frontend authentication)
- Create account at https://clerk.com
- Obtain API keys and webhook signing secret
Stripe Account (optional, for billing features)
- Create account at https://stripe.com
- Obtain publishable and secret keys, plus product price IDs

AWS IAM Permissions

The IAM user deploying Dagy needs these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "apigateway:*",
        "cloudformation:*",
        "ec2:*",
        "ecr:*",
        "ecs:*",
        "events:*",
        "iam:*",
        "lambda:*",
        "logs:*",
        "s3:*",
        "sns:*",
        "sqs:*",
        "states:*"
      ],
      "Resource": "*"
    }
  ]
}

Architecture Overview

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                        Frontend (Next.js)                    │
│         Hosted on Vercel, CloudFront+S3, or ECS             │
│                    (Clerk Authentication)                    │
└────────────────────────┬────────────────────────────────────┘
                         │ HTTPS
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              API Gateway (HTTP) + JWT Authorizer             │
└────────────────────────┬────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
   ┌─────────┐    ┌─────────────┐   ┌──────────┐
   │ Lambda  │    │ Step        │   │   ECS    │
   │  (API)  │    │ Functions   │   │ Fargate  │
   │ Runner  │    │  (Workflow) │   │ (Tasks)  │
   └────┬────┘    └──────┬──────┘   └────┬─────┘
        │                │               │
        └────────────────┼───────────────┘
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
   ┌─────────┐    ┌──────────────┐  ┌─────────┐
   │Database │    │      S3      │  │   SQS   │
   │ (21 tbl)│    │ (Artifacts)  │  │ (Events)│
   └─────────┘    └──────────────┘  └─────────┘

Components Overview

Component	Purpose	Technology
Frontend	Web UI for DAG management, monitoring, scheduling	Next.js 14+, Clerk Auth
API Layer	REST API for flows, runs, deployments, scheduling	FastAPI + Mangum on Lambda
Lambda Runner	Default execution backend for tasks	Python 3.12 Lambda Runtime
Step Functions	Workflow state machine execution	AWS Step Functions
ECS Fargate	Long-running task execution	ECS on Fargate
Database	Core data persistence (21 tables)	Managed
S3	Flow artifacts, logs, task outputs	S3 Buckets
SQS	Event queue for async task scheduling	SQS Standard Queue

Data Flow Examples

Flow Registration

User submits flow artifact → API Lambda validates →
  Store in S3 → Create FLOWS table entry →
  Return flow ID to user

Run Execution

Trigger run request → API Lambda creates RUN record →
  Enqueue to SQS → Execution backend polls SQS →
  Execute tasks → Update RUN/TASK_RUNS status →
  Store artifacts in S3 → Send completion event

Scheduling

Schedule created in UI → Store in SCHEDULES table →
  EventBridge CloudWatch Event triggers API Lambda →
  Create run via API → Follow normal run execution flow

Infrastructure Deployment

Step 1: Clone Repository and Install Dependencies

# Clone the Dagy repository
git clone https://github.com/dagy/dagy.git
cd dagy

# Install Python dependencies
uv sync --extra api

# Install Node dependencies (for frontend build, if deploying together)
cd web
npm install
cd ..

Step 2: Configure CDK Context Parameters

Create environment-specific configuration files in the infrastructure/ directory. The CDK app expects YAML files named after environments.

Example: `infrastructure/develop.yml`

---
environment: develop
region: us-east-1
app: dagy
owner: data
company: my-company
project_cost: engineering
aws_account: "123456789012"
python_version: "3.12"
dagy:
  # Optional VPC configuration (for private environments)
  # vpc_id: "vpc-12345678"
  # subnet_ids:
  #   - "subnet-12345678"
  #   - "subnet-87654321"
  # security_group_ids:
  #   - "sg-12345678"
  # JWT authentication
  # jwt_required: true
  # jwt_issuer: "https://your-auth-provider.com"
  # jwt_audience: "your-api-audience"
  # jwks_url: "https://your-auth-provider.com/.well-known/jwks.json"
  # CORS configuration
  api_cors_allowed_origins:
    - "http://localhost:3000"
    - "https://yourdomain.com"
  # Leave image URIs unset to use content-addressed CDK Docker assets.
  # Optional immutable pre-published image overrides:
  # lambda_image_uri: "123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-service-worker-develop@sha256:..."
  # ecr_repository_name: "dagy-service-worker-develop"
  # ecs_worker_image_uri: "123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-worker-develop@sha256:..."

Example: `infrastructure/production.yml`

---
environment: production
region: us-east-1
app: dagy
owner: data
company: my-company
project_cost: engineering
aws_account: "987654321098"
python_version: "3.12"
dagy:
  vpc_id: "vpc-prod123456"
  subnet_ids:
    - "subnet-prod111111"
    - "subnet-prod222222"
  security_group_ids:
    - "sg-prod123456"
  jwt_required: true
  jwt_issuer: "https://your-auth-provider.com"
  jwt_audience: "dagy-api"
  jwks_url: "https://your-auth-provider.com/.well-known/jwks.json"
  api_cors_allowed_origins:
    - "https://dagy.yourdomain.com"
    - "https://api.dagy.yourdomain.com"

Step 3: Docker Image Assets

The Lambda and ECS worker images are native CDK Docker image assets. CDK hashes the Dockerfiles and src/ build context, then builds and publishes only assets that are not already present in the target account's bootstrap ECR repository.

# No separate docker build or publish command is required.
# Continue directly to the CDK deployment step below.

Docker must be running when a changed asset needs to be built. The Lambda image is based on public.ecr.aws/lambda/python:3.12 and includes the API runtime dependencies.

Step 4: Bootstrap and Deploy CDK Stack

# Bootstrap CDK (one-time setup per AWS account/region)
cd infrastructure
cdk bootstrap aws://ACCOUNT-ID/REGION \
  --profile your-aws-profile

# Example:
cdk bootstrap aws://123456789012/us-east-1 --profile default

# Deploy the stack
cdk deploy -c environment=develop \
  --require-approval never \
  --profile your-aws-profile

# Or with explicit parameters:
cdk deploy -c environment=develop \
  -c region=us-east-1 \
  -c account=123456789012 \
  -c aws_profile=default \
  --require-approval never

The deployment process will:

Validate environment configuration
Fingerprint the Lambda and ECS worker Docker build contexts
Build and publish only changed image assets
Create/update the CloudFormation stack with immutable image URIs
Output stack outputs with resource names and endpoints

Step 5: Collect Deployment Outputs

After successful deployment, the CDK outputs will include:

Outputs:
dagy-develop.APIEndpoint = https://xyz123.execute-api.us-east-1.amazonaws.com
dagy-develop.FlowsTableName = dagy-flows-develop
dagy-develop.DeploymentsTableName = dagy-deployments-develop
dagy-develop.RunsTableName = dagy-runs-develop
dagy-develop.TaskRunsTableName = dagy-task-runs-develop
dagy-develop.SchedulesTableName = dagy-schedules-develop
dagy-develop.UsersTableName = dagy-users-develop
dagy-develop.AccessTokensTableName = dagy-access-tokens-develop
dagy-develop.AccessLogsTableName = dagy-access-logs-develop
dagy-develop.OrganizationsTableName = dagy-organizations-develop
dagy-develop.MembershipsTableName = dagy-memberships-develop
dagy-develop.APIKeysTableName = dagy-api-keys-develop
dagy-develop.DAGDraftsTableName = dagy-dag-drafts-develop
dagy-develop.UsageEventsTableName = dagy-usage-events-develop
dagy-develop.UsageAggregatesTableName = dagy-usage-aggregates-develop
dagy-develop.SubscriptionsTableName = dagy-subscriptions-develop
dagy-develop.AuditLogsTableName = dagy-audit-logs-develop
dagy-develop.SecretsTableName = dagy-secrets-develop
dagy-develop.NotificationChannelsTableName = dagy-notification-channels-develop
dagy-develop.AlertRulesTableName = dagy-alert-rules-develop
dagy-develop.EnvironmentsTableName = dagy-environments-develop
dagy-develop.SensorsTableName = dagy-sensors-develop
dagy-develop.ArtifactBucketName = dagy-artifacts-123456789012-us-east-1-develop
dagy-develop.EventsQueueURL = https://sqs.us-east-1.amazonaws.com/123456789012/dagy-events-develop
dagy-develop.LambdaFunctionName = dagy-lambda-develop
dagy-develop.EventBridgeRuleArn = arn:aws:events:us-east-1:123456789012:rule/dagy-schedule-rule-develop

Save these outputs for the next configuration steps.

Environment Variables Reference

The Dagy API Lambda requires the following environment variables to be set. These are automatically configured by the CDK stack based on the resources it creates.

Tables (Core)

Variable	Description	Required
`DAGY_FLOWS`	Flows table name	Yes
`DAGY_DEPLOYMENTS`	Deployments table name	Yes
`DAGY_RUNS`	Flow runs table name	Yes
`DAGY_TASK_RUNS`	Task runs table name	Yes
`DAGY_SCHEDULES`	Schedules table name	Yes

Tables (Authentication & Access)

Variable	Description	Required
`DAGY_USERS`	Users table name	Yes
`DAGY_ACCESS_TOKENS`	Access tokens table name	Yes
`DAGY_ACCESS_LOGS`	Access logs for auditing	Yes

Tables (Organizations & Teams)

Variable	Description	Required
`DAGY_ORGANIZATIONS`	Organizations table name	Yes
`DAGY_MEMBERSHIPS`	Organization memberships	Yes
`DAGY_API_KEYS`	API keys for programmatic access	Yes

Tables (Flow Builder)

Variable	Description	Required
`DAGY_DAG_DRAFTS`	Unsaved DAG drafts	Yes

Tables (Billing & Usage)

Variable	Description	Required
`DAGY_USAGE_EVENTS`	Individual API call events	Yes
`DAGY_USAGE_AGGREGATES`	Aggregated usage metrics	Yes
`DAGY_SUBSCRIPTIONS`	Subscription/plan information	Yes

Tables (Enterprise Features)

Variable	Description	Required
`DAGY_AUDIT_LOGS`	Detailed audit trail	Yes
`DAGY_SECRETS`	Encrypted secrets storage	Yes
`DAGY_NOTIFICATION_CHANNELS`	Alert notification destinations	Yes
`DAGY_ALERT_RULES`	Alert rule definitions	Yes
`DAGY_ENVIRONMENTS`	Deployment environments	Yes
`DAGY_SENSORS`	Sensor/trigger configurations	Yes

Storage & Queue

Variable	Description	Required
`DAGY_ARTIFACT_BUCKET`	S3 bucket for flow artifacts	Yes
`DAGY_EVENTS_QUEUE_URL`	SQS queue URL for task events	Yes
`DAGY_FLOW_EXECUTOR_FUNCTION`	Lambda function name/ARN for the Flow Executor backend	No (required for `in-process`/`task-isolated` execution modes)

Secrets & Encryption

Variable	Description	Required
`DAGY_SECRETS_KEY`	Fernet key for encrypting secrets	No (required if secrets used)

Format: Base64-encoded Fernet key. Generate with:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

JWT Authentication

Variable	Description	Required
`DAGY_JWT_REQUIRED`	Enable JWT authentication (`true`/`false`)	No (default: `false`)
`DAGY_JWT_ISSUER`	JWT issuer URL	Yes if JWT required
`DAGY_JWT_AUDIENCE`	JWT audience claim	Yes if JWT required
`DAGY_JWKS_URL`	JWKS endpoint URL	Yes if JWT required
`DAGY_ACCESS_TOKEN_TTL_SECONDS`	Token expiration in seconds	No (default: `86400` = 24 hours)

Stripe (Optional - For Billing)

Variable	Description	Required
`STRIPE_SECRET_KEY`	Stripe API secret key	No (required for billing)
`STRIPE_WEBHOOK_SECRET`	Stripe webhook signing secret	No (required for billing)
`STRIPE_PRICE_PRO`	Stripe price ID for Pro plan	No
`STRIPE_PRICE_ENTERPRISE`	Stripe price ID for Enterprise plan	No

Example Lambda Environment Variables Configuration

Via AWS Lambda console or CDK, set:

DAGY_FLOWS=dagy-flows-develop
DAGY_DEPLOYMENTS=dagy-deployments-develop
DAGY_RUNS=dagy-runs-develop
DAGY_TASK_RUNS=dagy-task-runs-develop
DAGY_SCHEDULES=dagy-schedules-develop
DAGY_USERS=dagy-users-develop
DAGY_ACCESS_TOKENS=dagy-access-tokens-develop
DAGY_ACCESS_LOGS=dagy-access-logs-develop
DAGY_ORGANIZATIONS=dagy-organizations-develop
DAGY_MEMBERSHIPS=dagy-memberships-develop
DAGY_API_KEYS=dagy-api-keys-develop
DAGY_DAG_DRAFTS=dagy-dag-drafts-develop
DAGY_USAGE_EVENTS=dagy-usage-events-develop
DAGY_USAGE_AGGREGATES=dagy-usage-aggregates-develop
DAGY_SUBSCRIPTIONS=dagy-subscriptions-develop
DAGY_AUDIT_LOGS=dagy-audit-logs-develop
DAGY_SECRETS=dagy-secrets-develop
DAGY_NOTIFICATION_CHANNELS=dagy-notification-channels-develop
DAGY_ALERT_RULES=dagy-alert-rules-develop
DAGY_ENVIRONMENTS=dagy-environments-develop
DAGY_SENSORS=dagy-sensors-develop
DAGY_ARTIFACT_BUCKET=dagy-artifacts-123456789012-us-east-1-develop
DAGY_EVENTS_QUEUE_URL=https://sqs.us-east-1.amazonaws.com/123456789012/dagy-events-develop
DAGY_SECRETS_KEY=<base64-fernet-key-here>
DAGY_JWT_REQUIRED=false
DAGY_ACCESS_TOKEN_TTL_SECONDS=86400

Frontend Deployment

Option 1: Deploy on Vercel (Recommended)

Vercel provides the easiest deployment path with automatic builds, edge caching, and HTTPS.

Prerequisites

Vercel account (https://vercel.com)
GitHub repository with Dagy code

Steps

Connect GitHub repository to Vercel
- Go to https://vercel.com/new
- Select your Dagy GitHub repository
- Vercel auto-detects it as a Next.js project

Configure Clerk environment variables

NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_live_xxxxx
CLERK_SECRET_KEY=sk_live_xxxxx
NEXT_PUBLIC_CLERK_SIGN_IN_URL=/sign-in
NEXT_PUBLIC_CLERK_SIGN_UP_URL=/sign-up
NEXT_PUBLIC_CLERK_AFTER_SIGN_IN_URL=/flows
NEXT_PUBLIC_CLERK_AFTER_SIGN_UP_URL=/flows

Configure API endpoint environment variables

NEXT_PUBLIC_API_URL=https://api.dagy.io/app

Deploy
- Click "Deploy"
- Vercel builds and deploys automatically on every push to main
- Get your frontend URL (e.g., https://dagy.vercel.app)

Option 2: CloudFront + S3 Deployment

For organizations preferring AWS-only solutions.

Prerequisites

AWS CloudFront and S3 setup
ACM certificate for domain

Steps

Build Next.js application
```
cd web
npm install
npm run build
```

Create S3 bucket for static exports

aws s3 mb s3://dagy-frontend-production --region us-east-1

# Enable static website hosting
aws s3api put-bucket-website \
  --bucket dagy-frontend-production \
  --website-configuration '{
    "IndexDocument": {"Suffix": "index.html"},
    "ErrorDocument": {"Key": "404.html"}
  }'

Upload built files

# Export static site from Next.js
npm run export  # Requires static export config in next.config.js

# Sync to S3
aws s3 sync out/ s3://dagy-frontend-production/ --delete

Create CloudFront distribution

# Create invalidation to clear cache
aws cloudfront create-invalidation \
  --distribution-id E123ABC \
  --paths "/*"

Option 3: ECS Fargate Deployment

For full containerization within AWS.

Prerequisites

ECS cluster
ECR repository
ALB/NLB for routing

Steps

Build Docker image

cd web
docker build -t dagy-frontend:latest -f Dockerfile.prod .

Push to ECR

aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

docker tag dagy-frontend:latest \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-frontend:latest

docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-frontend:latest

Create ECS task definition with:
- Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-frontend:latest
- Port: 3000
- Environment variables for Clerk and API URL
Create ECS service with ALB target group

Clerk Configuration

Create Clerk application
- Go to https://dashboard.clerk.com
- Create new application
- Choose "Web" and "Next.js"
Get API keys
- Navigate to "API Keys"
- Copy "Publishable Key" and "Secret Key"
Configure allowed origins
- Go to "Domains"
- Add your frontend domain(s)
- Add API Gateway domain if using cross-origin auth
Setup webhooks (for user sync to database)
- Go to "Webhooks"
- Create webhook for user.created and user.deleted events
- Point to https://api.yourdomain.com/webhooks/clerk

Frontend Environment Variables

# Clerk Authentication
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_live_xxxxx
CLERK_SECRET_KEY=sk_live_xxxxx

# API Configuration
NEXT_PUBLIC_API_URL=https://api.dagy.io/app
NEXT_PUBLIC_API_VERSION=v1

# Clerk URLs
NEXT_PUBLIC_CLERK_SIGN_IN_URL=/sign-in
NEXT_PUBLIC_CLERK_SIGN_UP_URL=/sign-up
NEXT_PUBLIC_CLERK_AFTER_SIGN_IN_URL=/flows
NEXT_PUBLIC_CLERK_AFTER_SIGN_UP_URL=/flows

# Optional: Analytics, error tracking, etc.
NEXT_PUBLIC_SENTRY_DSN=https://xxxxx@sentry.io/xxxxx

CORS Configuration for API

If frontend and API are on different domains, configure CORS in CDK:

# In infrastructure/develop.yml
dagy:
  api_cors_allowed_origins:
    - "https://dagy.yourdomain.com"
    - "https://dagy.vercel.app"
    - "http://localhost:3000"  # For local development

Backend Configuration

Execution Backends

Dagy supports three execution backends for running tasks. Configure which backends are available in your environment.

1. Lambda Backend (Default)

Simplest option; no additional configuration needed beyond Lambda function permissions.

# In dagy_api/backends/lambda_backend.py
# Automatically invokes task functions as Lambda functions
# Default concurrency: 1000 (account limit)

Pros:

Zero infrastructure management
Automatic scaling
Pay-per-execution pricing

Cons:

15-minute timeout limit per execution
Limited to Lambda execution environment
Cold starts impact latency

Configuration:

Set EXECUTION_BACKEND=lambda (default)
No additional environment variables needed

2. Step Functions Backend

For complex workflow orchestration with state machines.

Pros:

1-year execution duration
Complex branching and retry logic
Visual workflow monitoring in AWS Console

Cons:

More expensive per execution
Requires separate state machine definition
Additional complexity

Setup:

Create IAM role for Step Functions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "lambda:InvokeFunction",
        "states:StartExecution"
      ],
      "Resource": "*"
    }
  ]
}

Configure in CDK:

# infrastructure/develop.yml
dagy:
  step_functions_role_arn: "arn:aws:iam::123456789012:role/step-functions-role"

3. ECS Fargate Backend

For long-running tasks, custom dependencies, or GPU workloads.

Pros:

Full container control
GPU support via instance types
Custom runtimes and libraries
15-hour task duration

Cons:

Requires ECS cluster management
Higher baseline costs
More operational overhead

Setup:

Create ECS cluster:

aws ecs create-cluster --cluster-name dagy-tasks

# Create CloudWatch log group
aws logs create-log-group --log-group-name /ecs/dagy-tasks

Create task definition:

aws ecs register-task-definition \
  --family dagy-task-worker \
  --network-mode awsvpc \
  --requires-compatibilities FARGATE \
  --cpu 256 \
  --memory 512 \
  --container-definitions '{
    "name": "task-worker",
    "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-worker:latest",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/dagy-tasks",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }'

Configure in CDK:

# infrastructure/develop.yml
dagy:
  ecs_cluster_name: "dagy-tasks"
  ecs_task_definition_arn: "arn:aws:ecs:us-east-1:123456789012:task-definition/dagy-task-worker:1"
  ecs_subnets:
    - "subnet-12345678"
  ecs_security_groups:
    - "sg-12345678"

Rate Limiting Configuration

Dagy includes token bucket rate limiting to prevent abuse.

Default configuration:

120 requests per minute per API key/IP
200 token burst capacity

Customize in src/dagy_api/app.py:

from dagy_api.rate_limit import RateLimitMiddleware

app.add_middleware(
    RateLimitMiddleware,
    requests_per_minute=120,  # Requests per minute
    burst_size=200             # Max burst tokens
)

Rate limit headers in responses:

X-RateLimit-Limit: 120
X-RateLimit-Remaining: 45
Retry-After: 30

RBAC (Role-Based Access Control)

Dagy implements organization and membership-based RBAC.

Roles:

Owner: Full access to organization
Admin: Manage flows, runs, schedules, users
Developer: Create and run flows
Viewer: Read-only access

Membership management:

Store in DAGY_MEMBERSHIPS table
Contains: user_id, org_id, role
Check role on every API request

Example permission check:

from dagy_api.auth import check_org_permission

async def create_flow(org_id: str, request: Request):
    check_org_permission(
        org_id=org_id,
        user_id=request.state.user_id,
        required_role="developer"
    )
    # ... create flow

Security Configuration

JWT Authentication Setup

Enable JWT authentication to require valid tokens for all API requests.

Prerequisites

JWT issuer (e.g., Auth0, Clerk, Cognito)
JWKS (JSON Web Key Set) endpoint
JWT issuer URL and audience

Enable JWT Authentication

Configure in environment YAML:

# infrastructure/develop.yml
dagy:
  jwt_required: true
  jwt_issuer: "https://your-auth-provider.com"
  jwt_audience: "dagy-api"
  jwks_url: "https://your-auth-provider.com/.well-known/jwks.json"

Set Lambda environment variables:

DAGY_JWT_REQUIRED=true
DAGY_JWT_ISSUER=https://your-auth-provider.com
DAGY_JWT_AUDIENCE=dagy-api
DAGY_JWKS_URL=https://your-auth-provider.com/.well-known/jwks.json

Redeploy CDK stack:

cd infrastructure
cdk deploy --env develop

JWT Validation Flow

1. Client sends request: Authorization: Bearer eyJhbGc...
2. API Gateway HttpJwtAuthorizer validates token
3. JWKS endpoint verifies signature
4. Lambda receives request with claims in context
5. API checks scopes/permissions

API Key Management

For programmatic access, implement API keys as alternative to JWT.

API key format: dagy_[base32-encoded-32-random-bytes]

Example:

dagy_JBSWY3DPEBLW64TMMQ======

Storage:

Hash API keys before storing in DAGY_API_KEYS table
Use bcrypt or Argon2

Validation in middleware:

async def validate_api_key(request: Request):
    auth_header = request.headers.get("authorization", "")
    if auth_header.startswith("Bearer dagy_"):
        api_key = auth_header.split(" ", 1)[1]
        # Hash and look up in database
        # Set request.state.user_id and request.state.org_id

Secrets Encryption

Store sensitive data (API keys, passwords, credentials) encrypted.

Generate Fernet Encryption Key

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Output: gAAAAABl4_2_wqtfT8qjU...

Set in Lambda

aws lambda update-function-configuration \
  --function-name dagy-api-lambda \
  --environment Variables={DAGY_SECRETS_KEY=gAAAAABl4_2_wqtfT8qjU...}

Encrypt Secrets in Code

from cryptography.fernet import Fernet
import os

key = os.getenv("DAGY_SECRETS_KEY").encode()
cipher = Fernet(key)

def encrypt_secret(value: str) -> str:
    return cipher.encrypt(value.encode()).decode()

def decrypt_secret(encrypted: str) -> str:
    return cipher.decrypt(encrypted.encode()).decode()

# Usage
encrypted = encrypt_secret("my-api-key-123")
decrypted = decrypt_secret(encrypted)

Store in DAGY_SECRETS Table

# Database schema
{
  "secret_id": "sec_12345",           # Partition key
  "org_id": "org_abc",                # Sort key
  "name": "github-token",             # Friendly name
  "encrypted_value": "gAAAAABl...",  # Encrypted value
  "created_at": 1704067200,
  "created_by": "user_123"
}

VPC and Security Groups

For production, deploy in a VPC with restricted network access.

Configure VPC in CDK

# infrastructure/production.yml
dagy:
  vpc_id: "vpc-prod123456"
  subnet_ids:
    - "subnet-prod111111"  # Public subnet 1
    - "subnet-prod222222"  # Public subnet 2
  security_group_ids:
    - "sg-prod-worker"    # No ingress; outbound HTTPS allowed

Security Group Rules

Do not add ingress rules to the worker security group. Workers are pull-based and do not listen for connections. Allow outbound HTTPS (tcp/443) so tasks can reach ECR, S3, DynamoDB, CloudWatch, package indexes, and external integrations.

NAT-free execution networking

Application Lambdas run outside the VPC and use AWS Lambda service networking. Fargate tasks run in the configured public subnets with public IP assignment. The worker security group should have no ingress rules and allow outbound HTTPS. When importing a VPC, each configured subnet must route 0.0.0.0/0 to an internet gateway; a private subnet without NAT will not provide task egress.

IAM Least-Privilege Policies

Create minimal IAM roles for Lambda function.

Lambda execution role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DatabaseAccess",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:Query",
        "dynamodb:Scan",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "arn:aws:dynamodb:us-east-1:123456789012:table/dagy-flows-*",
        "arn:aws:dynamodb:us-east-1:123456789012:table/dagy-runs-*",
        "arn:aws:dynamodb:us-east-1:123456789012:table/dagy-*"
      ]
    },
    {
      "Sid": "S3ArtifactAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::dagy-artifacts-*/*"
    },
    {
      "Sid": "SQSAccess",
      "Effect": "Allow",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage"
      ],
      "Resource": "arn:aws:sqs:us-east-1:123456789012:dagy-events-*"
    },
    {
      "Sid": "LambdaInvoke",
      "Effect": "Allow",
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:dagy-*"
    },
    {
      "Sid": "CloudWatchLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/dagy-*"
    }
  ]
}

Monitoring & Observability

Health Check Endpoints

Dagy provides health check endpoints for monitoring.

Endpoints:

GET /health
GET /health/detailed

Example health response:

{
  "status": "healthy",
  "timestamp": "2024-03-01T12:00:00Z",
  "version": "0.1.0"
}

Detailed health response:

{
  "status": "healthy",
  "timestamp": "2024-03-01T12:00:00Z",
  "version": "0.1.0",
  "components": {
    "database": {
      "status": "healthy",
      "latency_ms": 42
    },
    "s3": {
      "status": "healthy",
      "latency_ms": 156
    },
    "sqs": {
      "status": "healthy",
      "latency_ms": 31
    }
  }
}

Configure health checks:

# In application load balancer
aws elbv2 create-target-group \
  --name dagy-api \
  --protocol HTTP \
  --port 80 \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

CloudWatch Metrics and Alarms

Enable Custom Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')

def publish_metric(metric_name: str, value: float, unit: str = "Count"):
    cloudwatch.put_metric_data(
        Namespace='Dagy',
        MetricData=[
            {
                'MetricName': metric_name,
                'Value': value,
                'Unit': unit,
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'Service', 'Value': 'api'}
                ]
            }
        ]
    )

# Example: Track flow execution
publish_metric('FlowExecutionTime', execution_time_ms, 'Milliseconds')
publish_metric('FailedRuns', 1, 'Count')

Create Alarms

# Lambda error rate alarm
aws cloudwatch put-metric-alarm \
  --alarm-name dagy-lambda-errors \
  --alarm-description "Alert if Lambda error rate > 1%" \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --statistic Sum \
  --period 300 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

# Database throttling alarm
aws cloudwatch put-metric-alarm \
  --alarm-name dagy-database-throttles \
  --alarm-description "Alert if database is throttled" \
  --metric-name ConsumedWriteCapacityUnits \
  --namespace AWS/DynamoDB \
  --statistic Sum \
  --period 60 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

Audit Logging Configuration

Store all user actions in DAGY_AUDIT_LOGS for compliance and debugging.

Audit log schema:

{
  "audit_id": "aud_abc123",            # Partition key
  "timestamp": 1704067200,             # Sort key
  "org_id": "org_xyz",
  "user_id": "user_123",
  "action": "flow_created",
  "resource_type": "flow",
  "resource_id": "flow_abc",
  "changes": {
    "name": {"old": null, "new": "my-flow"},
    "version": {"old": null, "new": "1.0.0"}
  },
  "ip_address": "203.0.113.42",
  "user_agent": "Mozilla/5.0...",
  "status": "success"  # or "failure"
}

Log all significant actions:

from dagy_api.audit import log_audit_event

async def create_flow(org_id: str, data: FlowData, request: Request):
    # Create flow...

    # Log audit event
    log_audit_event(
        org_id=org_id,
        user_id=request.state.user_id,
        action="flow_created",
        resource_type="flow",
        resource_id=flow.id,
        changes={"name": {"old": None, "new": flow.name}},
        ip_address=request.client.host,
        status="success"
    )

Alert Rules for Pipeline Monitoring

Configure alerts to notify on flow execution failures, SLA breaches, etc.

Alert rule schema:

{
  "rule_id": "rule_123",               # Partition key
  "org_id": "org_xyz",                 # Sort key
  "name": "High failure rate",
  "enabled": True,
  "condition": {
    "metric": "run_failure_rate",
    "threshold": 0.1,  # 10% failure rate
    "window_minutes": 5,
    "operator": "GreaterThan"
  },
  "notification_channels": ["channel_slack_123"],
  "created_at": 1704067200
}

Example: Create alert rule via API

curl -X POST https://api.example.com/alerts/rules \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Flow failure rate",
    "metric": "run_failure_rate",
    "threshold": 0.15,
    "window_minutes": 10,
    "notification_channel_ids": ["channel_123"]
  }'

Managing Deployment Settings

After a flow is deployed, its runtime settings can be updated without redeploying the artifact. This is useful for changing execution strategy, adjusting schedules, or attaching new dependency packages.

Settings UI

The Dagy web UI provides a Flow Settings dialog accessible from the Flows page. Click the dropdown menu on any flow row and select Settings, or use the Flow Settings button in the flow detail panel.

The settings dialog allows you to update:

Runtime tier: Choose from nano through xlarge runtime tiers based on workload size
Default executor: Auto (determined by tier)
Schedule: Set or change the cron expression or interval for automated runs
Dependency packages: Attach or remove dependency package slugs resolved at runtime
Tags: Add, update, or remove key-value metadata tags

Changes that affect runtime behavior (runtime tier, schedule, dependency packages) trigger a confirmation prompt before saving. Existing in-progress runs are not affected; changes apply to new runs only.

Settings API

Use PUT /deployments/{name}/settings to update settings programmatically. Only the fields included in the request body are updated:

curl -X PUT https://api.dagy.io/v1/deployments/daily-etl/settings \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "execution_mode": "micro",
    "schedule": "0 9 * * 1-5",
    "dep_package_slugs": ["pandas-layer"]
  }'

See the API Reference for the full field list and response schema.

Scaling & Performance

Database Capacity Planning

The database supports two billing modes:

On-Demand (Default)

Recommended for variable workloads
Auto-scales capacity
Pay per request

Enable on-demand:

# CDK automatically defaults to on-demand
# In dagy_stack.py:
table = dynamodb.Table(
    self, "flows-table",
    partition_key=Attribute(name="flow_id", type=AttributeType.STRING),
    billing_mode=BillingMode.PAY_PER_REQUEST  # Default
)

Provisioned

Better for predictable workloads
Cheaper at higher scale
Requires capacity planning

Example: Switch to provisioned

table = dynamodb.Table(
    self, "flows-table",
    partition_key=Attribute(name="flow_id", type=AttributeType.STRING),
    billing_mode=BillingMode.PROVISIONED,
    read_capacity=100,  # RCUs
    write_capacity=100  # WCUs
)

# Enable auto-scaling
table.auto_scale_read_capacity(
    min_capacity=10,
    max_capacity=1000
)
table.auto_scale_write_capacity(
    min_capacity=10,
    max_capacity=1000
)

Capacity Calculation

For read capacity:

1 RCU = 1 strongly consistent read/sec or 2 eventually consistent reads/sec
Example: 1000 flow reads/minute = 17 RCUs minimum

For write capacity:

1 WCU = 1 write/sec
Example: 500 flow creates/minute = 9 WCUs minimum

Add 30% buffer for spikes:

Required RCUs = (peak_reads_per_sec / 1.0) * 1.3
Required WCUs = (peak_writes_per_sec / 1.0) * 1.3

Lambda Concurrency Limits

Control Lambda concurrency to manage costs and prevent throttling.

Account concurrency: Default 1000 concurrent executions

Set function concurrency:

aws lambda put-function-concurrency \
  --function-name dagy-api-lambda \
  --reserved-concurrent-executions 500

Reserved concurrency:

Guarantees capacity for critical functions
Deducts from account total
Useful for production environments

Cold start optimization:

# Provision capacity to eliminate cold starts
aws lambda put-provisioned-concurrency-config \
  --function-name dagy-api-lambda \
  --provisioned-concurrent-executions 50 \
  --qualifier LIVE

ECS Task Scaling

Auto-scale ECS tasks based on CPU/memory metrics.

# Create auto-scaling target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/dagy-tasks/dagy-worker \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 100

# CPU-based scaling policy
aws application-autoscaling put-scaling-policy \
  --policy-name scale-by-cpu \
  --service-namespace ecs \
  --resource-id service/dagy-tasks/dagy-worker \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

SQS Visibility Timeout Tuning

Configure SQS visibility timeout to match task execution time.

Default: 30 seconds

Set visibility timeout:

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/dagy-events \
  --attributes VisibilityTimeout=300  # 5 minutes for longer tasks

Recommended:

Short tasks (< 1 min): 120 seconds
Medium tasks (1-5 min): 300 seconds
Long tasks (5+ min): 900 seconds

Performance Optimization Checklist

Enable database auto-scaling or set appropriate capacity
Use VPC endpoints for private access to AWS services
Enable database caching for hot data
Implement query pagination for large result sets
Use batch operations for bulk inserts/updates
Configure appropriate CloudWatch metrics and alarms
Set up CloudFront caching for static assets
Implement database connection pooling
Monitor Lambda cold start times and optimize layer size

Backup & Disaster Recovery

Database Point-in-Time Recovery

Enable automatic backups for all database tables.

# Enable PITR for all tables
for table in dagy-flows-dev dagy-runs-dev dagy-deployments-dev; do
  aws dynamodb update-continuous-backups \
    --table-name $table \
    --point-in-time-recovery-specification \
    PointInTimeRecoveryEnabled=true
done

Restore from backup:

# Restore to a specific time
aws dynamodb restore-table-to-point-in-time \
  --source-table-name dagy-runs-dev \
  --target-table-name dagy-runs-dev-restored \
  --restore-date-time 2024-03-01T12:00:00Z

S3 Versioning for Artifacts

Enable versioning on artifact buckets to protect against accidental deletion.

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket dagy-artifacts-123456789012-us-east-1-development \
  --versioning-configuration Status=Enabled

# Enable lifecycle policy to expire old versions after 90 days
aws s3api put-bucket-lifecycle-configuration \
  --bucket dagy-artifacts-123456789012-us-east-1-development \
  --lifecycle-configuration '{
    "Rules": [
      {
        "Id": "expire-old-versions",
        "Status": "Enabled",
        "NoncurrentVersionExpirationInDays": 90
      }
    ]
  }'

Cross-Region Replication

For disaster recovery, replicate critical data across regions.

# Create replication role
aws iam create-role \
  --role-name s3-replication-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {"Service": "s3.amazonaws.com"},
        "Action": "sts:AssumeRole"
      }
    ]
  }'

# Attach replication policy
aws iam put-role-policy \
  --role-name s3-replication-role \
  --policy-name replication \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": ["s3:GetReplicationConfiguration", "s3:ListBucket"],
        "Resource": "arn:aws:s3:::dagy-artifacts-*"
      },
      {
        "Effect": "Allow",
        "Action": ["s3:GetObjectVersionForReplication", "s3:GetObjectVersionAcl"],
        "Resource": "arn:aws:s3:::dagy-artifacts-*/*"
      },
      {
        "Effect": "Allow",
        "Action": ["s3:ReplicateObject", "s3:ReplicateDelete"],
        "Resource": "arn:aws:s3:::dagy-artifacts-replica/*"
      }
    ]
  }'

# Create replica bucket in different region
aws s3api create-bucket \
  --bucket dagy-artifacts-replica \
  --region us-west-2 \
  --create-bucket-configuration LocationConstraint=us-west-2

# Enable replication
aws s3api put-bucket-replication \
  --bucket dagy-artifacts-123456789012-us-east-1-development \
  --replication-configuration '{
    "Role": "arn:aws:iam::123456789012:role/s3-replication-role",
    "Rules": [
      {
        "Status": "Enabled",
        "Priority": 1,
        "DeleteMarkerReplication": {"Status": "Enabled"},
        "Filter": {"Prefix": ""},
        "Destination": {
          "Bucket": "arn:aws:s3:::dagy-artifacts-replica",
          "ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
          "Metrics": {"Status": "Enabled", "EventThreshold": {"Minutes": 15}}
        }
      }
    ]
  }'

Backup Strategy

Recommended backup frequency:

Database: Continuous (PITR enabled) + daily snapshots
S3: Versioning enabled + cross-region replication
Secrets: Encrypted backups in separate AWS account

Restore procedure:

Restore database tables from PITR to new table
Verify data integrity in non-production
Update Lambda environment variables to point to restored tables
Validate API functionality
Gradually shift traffic to restored environment

Upgrading

CDK Stack Updates

Minor updates (configuration changes, security patches):

cd infrastructure

# Review changes
cdk diff --env production

# Deploy
cdk deploy --env production \
  --require-approval never

Database Migration Considerations

Schema changes:

The database is schemaless, but application code enforces structure
Add backward compatibility for new fields
Use feature flags to enable new functionality gradually

Example: Add new field with default

# Old code
run = {
    "run_id": "run_123",
    "status": "completed"
}

# New code - handle missing field
status = run.get("completion_time", None)

Migrate existing records:

import boto3

dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("dagy-runs-prod")

# Scan and update all records
response = table.scan()
for item in response["Items"]:
    if "completion_time" not in item:
        table.update_item(
            Key={"run_id": item["run_id"]},
            UpdateExpression="SET completion_time = :ct",
            ExpressionAttributeValues={":ct": 0}
        )

Zero-Downtime Deployment Strategy

Blue-Green Deployment:

Deploy new Lambda version alongside existing version
Update 10% of traffic to new version using API Gateway weighted routing
Monitor errors and metrics
Gradually increase traffic: 25% → 50% → 100%
Rollback immediately if issues detected

# Create alias for traffic shifting
aws lambda create-alias \
  --function-name dagy-api \
  --name LIVE \
  --function-version 1

# Update alias to shift traffic
aws lambda update-alias \
  --function-name dagy-api \
  --name LIVE \
  --function-version 2 \
  --routing-config AdditionalVersionWeight=0.1  # 10% to v2, 90% to v1

Canary Deployment:

# API Gateway canary setting
aws apigatewayv2 create-deployment \
  --api-id abc123 \
  --stage-name prod \
  --canary-settings traceEnabled=true,useStageCache=false,percentTraffic=10

Troubleshooting

Common Deployment Issues

Issue: CDK bootstrap fails

Symptoms: TemplateURL must be a valid S3 URL

Solution:

# Ensure AWS credentials are correct
aws sts get-caller-identity

# Try bootstrap again with explicit parameters
cdk bootstrap aws://123456789012/us-east-1 \
  --profile default \
  --force

Issue: Lambda cannot access the database

Symptoms: User: arn:aws:lambda:... is not authorized to perform: dynamodb:GetItem

Solution:

# Verify Lambda execution role has database permissions
aws iam list-attached-role-policies \
  --role-name dagy-lambda-role

# Attach database policy if missing
aws iam attach-role-policy \
  --role-name dagy-lambda-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess

Issue: Lambda environment variables not set

Symptoms: KeyError: 'DAGY_FLOWS'

Solution:

# Check current Lambda config
aws lambda get-function-configuration \
  --function-name dagy-api-lambda \
  --query Environment

# Update environment variables
aws lambda update-function-configuration \
  --function-name dagy-api-lambda \
  --environment Variables={DAGY_FLOWS=dagy-flows-dev,DAGY_RUNS=dagy-runs-dev}

Health Check Debugging

Lambda health endpoint failing

# Test health endpoint directly
curl -X GET \
  https://xyz123.execute-api.us-east-1.amazonaws.com/health

# Check Lambda logs
aws logs tail /aws/lambda/dagy-api-lambda --follow

# Invoke Lambda directly for debugging
aws lambda invoke \
  --function-name dagy-api-lambda \
  --payload '{"resource": "/health", "httpMethod": "GET"}' \
  response.json

cat response.json

Database connectivity issues

# Test database connection
import boto3

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamodb.Table("dagy-flows-dev")

try:
    response = table.get_item(Key={"flow_id": "test"})
    print("Database connectivity: OK")
except Exception as e:
    print(f"Database error: {e}")

Lambda Cold Start Optimization

Measure cold start time

import time

start = time.time()

# Lambda handler code

cold_start_ms = (time.time() - start) * 1000
print(f"Cold start time: {cold_start_ms}ms")

Cold start is > 500ms? Consider:

Reducing Lambda package size (250MB max)
Using Lambda layers for dependencies
Provisioned concurrency for predictable traffic
Moving to Lambda@Edge for reduced latency

Optimize Lambda package size

# Check current package size
aws lambda get-function \
  --function-name dagy-api-lambda \
  --query 'Configuration.CodeSize'

# Remove unnecessary files from Docker image
# In Dockerfile:
RUN find . -name "*.pyc" -delete
RUN find . -name "__pycache__" -type d -delete

Database Throttling

Symptoms

ProvisionedThroughputExceededException
Lambda timeouts
API 5xx errors

Solutions

# Check consumed capacity
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedWriteCapacityUnits \
  --dimensions Name=TableName,Value=dagy-flows-dev \
  --start-time 2024-03-01T00:00:00Z \
  --end-time 2024-03-01T23:59:59Z \
  --period 300 \
  --statistics Sum

# Increase capacity (for provisioned mode)
aws dynamodb update-table \
  --table-name dagy-flows-dev \
  --provisioned-throughput ReadCapacityUnits=200,WriteCapacityUnits=200

# Or switch to on-demand mode
aws dynamodb update-table \
  --table-name dagy-flows-dev \
  --billing-mode PAY_PER_REQUEST

Optimization strategies

Enable database change streams for CDC
Use batch operations for bulk reads and writes
Implement query pagination
Use TTL for automatic data cleanup
Create secondary indexes for common queries

Rate Limiting Issues

Symptoms

429 Too Many Requests responses
X-RateLimit-Remaining: 0 header

Adjust rate limits

# In src/dagy_api/app.py
app.add_middleware(
    RateLimitMiddleware,
    requests_per_minute=240,  # Increased from 120
    burst_size=400             # Increased from 200
)

Test rate limiting

# Send 150 requests in 1 minute
for i in {1..150}; do
  curl -X GET \
    https://api.example.com/health \
    -H "Authorization: Bearer dagy_test" \
    -w "Status: %{http_code}\n"
done

Frontend Issues

Clerk authentication not working

# Check Clerk keys are correctly set
echo $NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY

# Verify domain is allowed in Clerk dashboard
# Settings → Domains → check yourdomain.com is listed

# Check webhook is configured
# Settings → Webhooks → user.created event points to API

API CORS errors

Access to XMLHttpRequest at 'https://api.example.com/flows'
from origin 'https://frontend.example.com' has been blocked by CORS policy

Solution: Update CDK configuration

# infrastructure/production.yml
dagy:
  api_cors_allowed_origins:
    - "https://frontend.example.com"
    - "https://www.example.com"

Then redeploy:

cdk deploy --env production

Monitoring and Debugging

Enable verbose logging

# In Lambda environment
DAGY_LOG_LEVEL=DEBUG

# In local testing
export DAGY_LOCAL_VERBOSE=true
uv run python -m dagy_api.app

CloudWatch Insights queries

# Find errors in logs
fields @timestamp, @message, @logStream
| filter @message like /error|exception/i
| stats count() as errors by @logStream

# Track API latency
fields @duration
| stats avg(@duration), max(@duration), pct(@duration, 99)

# Find slow database queries
fields @duration, @message
| filter @message like /database|DynamoDB/
| stats pct(@duration, 95), pct(@duration, 99)

Support and Resources

Documentation: https://docs.dagy.io
GitHub Issues: https://github.com/dagy/dagy/issues
Slack Community: https://dagy-community.slack.com
Email Support: support@dagy.io

Next Steps

Complete the deployment steps above
Run health checks on the API: curl https://api.example.com/health
Deploy the frontend and configure authentication
Create your first flow using the SDK
Deploy the flow and execute it via the API
Set up monitoring, alerting, and backup policies
Configure RBAC and security policies for your organization

Version: 1.0.0 Last Updated: March 2024 Author: Dagy Team

Dagy Self-Hosted Deployment Guide

Table of Contents

Prerequisites

AWS Account Requirements

Local Development Environment

External Services

AWS IAM Permissions

Architecture Overview

Component Diagram

Components Overview

Data Flow Examples

Flow Registration

Run Execution

Scheduling

Infrastructure Deployment

Step 1: Clone Repository and Install Dependencies

Step 2: Configure CDK Context Parameters

Example: infrastructure/develop.yml

Example: infrastructure/production.yml

Step 3: Docker Image Assets

Step 4: Bootstrap and Deploy CDK Stack

Step 5: Collect Deployment Outputs

Environment Variables Reference

Tables (Core)

Tables (Authentication & Access)

Tables (Organizations & Teams)

Tables (Flow Builder)

Tables (Billing & Usage)

Tables (Enterprise Features)

Storage & Queue

Secrets & Encryption

JWT Authentication

Stripe (Optional - For Billing)

Example Lambda Environment Variables Configuration

Frontend Deployment

Option 1: Deploy on Vercel (Recommended)

Prerequisites

Steps

Option 2: CloudFront + S3 Deployment

Prerequisites

Steps

Option 3: ECS Fargate Deployment

Prerequisites

Steps

Clerk Configuration

Frontend Environment Variables

CORS Configuration for API

Backend Configuration

Execution Backends

1. Lambda Backend (Default)

2. Step Functions Backend

3. ECS Fargate Backend

Rate Limiting Configuration

RBAC (Role-Based Access Control)

Security Configuration

JWT Authentication Setup

Prerequisites

Enable JWT Authentication

JWT Validation Flow

API Key Management

Secrets Encryption

Generate Fernet Encryption Key

Set in Lambda

Encrypt Secrets in Code

Store in DAGY_SECRETS Table

VPC and Security Groups

Configure VPC in CDK

Security Group Rules

NAT-free execution networking

IAM Least-Privilege Policies

Monitoring & Observability

Health Check Endpoints

CloudWatch Metrics and Alarms

Enable Custom Metrics

Create Alarms

Audit Logging Configuration

Alert Rules for Pipeline Monitoring

Managing Deployment Settings

Settings UI

Settings API

Example: `infrastructure/develop.yml`

Example: `infrastructure/production.yml`