Back to docs
Guides

Dagy Self-Hosted Deployment Guide

This guide provides comprehensive instructions for deploying Dagy, a DAG orchestration platform, in a self-hosted AWS environment. It covers infrastructure deployment, configuration, and operational best practices.

Table of Contents

  1. Prerequisites
  2. Architecture Overview
  3. Infrastructure Deployment
  4. Environment Variables Reference
  5. Frontend Deployment
  6. Backend Configuration
  7. Security Configuration
  8. Monitoring & Observability
  9. Scaling & Performance
  10. Backup & Disaster Recovery
  11. Upgrading
  12. Troubleshooting

Prerequisites

AWS Account Requirements

  • Active AWS account with appropriate IAM permissions
  • Access to AWS CloudFormation, Lambda, S3, SQS, EC2, ECS, and IAM services
  • EC2 key pair created (for ECS cluster access if needed)

Local Development Environment

  • Node.js 18+ (for frontend builds)
  • Python 3.11+ (for backend and CDK)
  • AWS CLI v2 configured with appropriate credentials
  • AWS CDK CLI v2+:
    npm install -g aws-cdk
    cdk --version  # Should be v2.x.x or higher
    
  • Docker 20.10+ (for building Lambda container images)
    docker --version
    docker login  # To push to ECR
    
  • Git for cloning the repository
  • uv (Python package manager - recommended):
    curl -LsSf https://astral.sh/uv/install.sh | sh
    

External Services

  • Clerk Account (for frontend authentication)
  • Stripe Account (optional, for billing features)
    • Create account at https://stripe.com
    • Obtain publishable and secret keys, plus product price IDs

AWS IAM Permissions

The IAM user deploying Dagy needs these permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "apigateway:*",
        "cloudformation:*",
        "ec2:*",
        "ecr:*",
        "ecs:*",
        "events:*",
        "iam:*",
        "lambda:*",
        "logs:*",
        "s3:*",
        "sns:*",
        "sqs:*",
        "states:*"
      ],
      "Resource": "*"
    }
  ]
}

Architecture Overview

Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                        Frontend (Next.js)                    │
│         Hosted on Vercel, CloudFront+S3, or ECS             │
│                    (Clerk Authentication)                    │
└────────────────────────┬────────────────────────────────────┘
                         │ HTTPS
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              API Gateway (HTTP) + JWT Authorizer             │
└────────────────────────┬────────────────────────────────────┘
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
   ┌─────────┐    ┌─────────────┐   ┌──────────┐
   │ Lambda  │    │ Step        │   │   ECS    │
   │  (API)  │    │ Functions   │   │ Fargate  │
   │ Runner  │    │  (Workflow) │   │ (Tasks)  │
   └────┬────┘    └──────┬──────┘   └────┬─────┘
        │                │               │
        └────────────────┼───────────────┘
                         │
        ┌────────────────┼────────────────┐
        ▼                ▼                ▼
   ┌─────────┐    ┌──────────────┐  ┌─────────┐
   │Database │    │      S3      │  │   SQS   │
   │ (21 tbl)│    │ (Artifacts)  │  │ (Events)│
   └─────────┘    └──────────────┘  └─────────┘

Components Overview

ComponentPurposeTechnology
FrontendWeb UI for DAG management, monitoring, schedulingNext.js 14+, Clerk Auth
API LayerREST API for flows, runs, deployments, schedulingFastAPI + Mangum on Lambda
Lambda RunnerDefault execution backend for tasksPython 3.12 Lambda Runtime
Step FunctionsWorkflow state machine executionAWS Step Functions
ECS FargateLong-running task executionECS on Fargate
DatabaseCore data persistence (21 tables)Managed
S3Flow artifacts, logs, task outputsS3 Buckets
SQSEvent queue for async task schedulingSQS Standard Queue

Data Flow Examples

Flow Registration

User submits flow artifact → API Lambda validates →
  Store in S3 → Create FLOWS table entry →
  Return flow ID to user

Run Execution

Trigger run request → API Lambda creates RUN record →
  Enqueue to SQS → Execution backend polls SQS →
  Execute tasks → Update RUN/TASK_RUNS status →
  Store artifacts in S3 → Send completion event

Scheduling

Schedule created in UI → Store in SCHEDULES table →
  EventBridge CloudWatch Event triggers API Lambda →
  Create run via API → Follow normal run execution flow

Infrastructure Deployment

Step 1: Clone Repository and Install Dependencies

# Clone the Dagy repository
git clone https://github.com/equinox-data/dagy.git
cd dagy

# Install Python dependencies
uv sync --extra api

# Install Node dependencies (for frontend build, if deploying together)
cd web
npm install
cd ..

Step 2: Configure CDK Context Parameters

Create environment-specific configuration files in the infrastructure/ directory. The CDK app expects YAML files named after environments.

Example: infrastructure/develop.yml

---
environment: develop
region: us-east-1
app: dagy
owner: data
company: my-company
project_cost: engineering
aws_account: "123456789012"
python_version: "3.12"
dagy:
  ecr_repository_name: "dagy-service-worker-develop"
  ecr_push_principals:
    - "arn:aws:iam::123456789012:role/eqx-buildmaster"
  # Optional VPC configuration (for private environments)
  # vpc_id: "vpc-12345678"
  # subnet_ids:
  #   - "subnet-12345678"
  #   - "subnet-87654321"
  # security_group_ids:
  #   - "sg-12345678"
  # JWT authentication
  # jwt_required: true
  # jwt_issuer: "https://your-auth-provider.com"
  # jwt_audience: "your-api-audience"
  # jwks_url: "https://your-auth-provider.com/.well-known/jwks.json"
  # CORS configuration
  api_cors_allowed_origins:
    - "http://localhost:3000"
    - "https://yourdomain.com"
  # Lambda container image (leave blank to auto-build)
  # lambda_image_uri: "123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-service-worker-develop:latest"
  # lambda_image_tag: "latest"

Example: infrastructure/production.yml

---
environment: production
region: us-east-1
app: dagy
owner: data
company: my-company
project_cost: engineering
aws_account: "987654321098"
python_version: "3.12"
dagy:
  ecr_repository_name: "dagy-service-worker-prod"
  ecr_push_principals:
    - "arn:aws:iam::987654321098:role/ci-cd-role"
  vpc_id: "vpc-prod123456"
  subnet_ids:
    - "subnet-prod111111"
    - "subnet-prod222222"
  security_group_ids:
    - "sg-prod123456"
  jwt_required: true
  jwt_issuer: "https://your-auth-provider.com"
  jwt_audience: "dagy-api"
  jwks_url: "https://your-auth-provider.com/.well-known/jwks.json"
  api_cors_allowed_origins:
    - "https://dagy.yourdomain.com"
    - "https://api.dagy.yourdomain.com"

Step 3: Build Lambda Container Image

The CDK deployment includes an automatic image build step via publish_image.sh. This script builds a Docker image optimized for AWS Lambda.

# Navigate to infrastructure directory
cd infrastructure

# Set environment variables
export DAGY_ENVIRONMENT=develop
export AWS_REGION=us-east-1

# The CDK app will automatically build and push the image to ECR
# Or manually build:
docker build -t dagy-api:latest ..

The Docker image is based on public.ecr.aws/lambda/python:3.12 and includes all necessary dependencies (FastAPI, Mangum, boto3, etc.).

Step 4: Bootstrap and Deploy CDK Stack

# Bootstrap CDK (one-time setup per AWS account/region)
cd infrastructure
cdk bootstrap aws://ACCOUNT-ID/REGION \
  --profile your-aws-profile

# Example:
cdk bootstrap aws://123456789012/us-east-1 --profile default

# Deploy the stack
cdk deploy --env develop \
  --require-approval never \
  --profile your-aws-profile

# Or with explicit parameters:
cdk deploy --env develop \
  -c environment=develop \
  -c region=us-east-1 \
  -c account=123456789012 \
  -c aws_profile=default \
  --require-approval never

The deployment process will:

  1. Validate environment configuration
  2. Build and push Lambda Docker image to ECR
  3. Create/update CloudFormation stack with all resources
  4. Output stack outputs with resource names and endpoints

Step 5: Collect Deployment Outputs

After successful deployment, the CDK outputs will include:

Outputs:
dagy-development.APIEndpoint = https://xyz123.execute-api.us-east-1.amazonaws.com
dagy-development.FlowsTableName = dagy-flows-development
dagy-development.DeploymentsTableName = dagy-deployments-development
dagy-development.RunsTableName = dagy-runs-development
dagy-development.TaskRunsTableName = dagy-task-runs-development
dagy-development.SchedulesTableName = dagy-schedules-development
dagy-development.UsersTableName = dagy-users-development
dagy-development.AccessTokensTableName = dagy-access-tokens-development
dagy-development.AccessLogsTableName = dagy-access-logs-development
dagy-development.OrganizationsTableName = dagy-organizations-development
dagy-development.MembershipsTableName = dagy-memberships-development
dagy-development.APIKeysTableName = dagy-api-keys-development
dagy-development.DAGDraftsTableName = dagy-dag-drafts-development
dagy-development.UsageEventsTableName = dagy-usage-events-development
dagy-development.UsageAggregatesTableName = dagy-usage-aggregates-development
dagy-development.SubscriptionsTableName = dagy-subscriptions-development
dagy-development.AuditLogsTableName = dagy-audit-logs-development
dagy-development.SecretsTableName = dagy-secrets-development
dagy-development.NotificationChannelsTableName = dagy-notification-channels-development
dagy-development.AlertRulesTableName = dagy-alert-rules-development
dagy-development.EnvironmentsTableName = dagy-environments-development
dagy-development.SensorsTableName = dagy-sensors-development
dagy-development.ArtifactBucketName = dagy-artifacts-123456789012-us-east-1-development
dagy-development.EventsQueueURL = https://sqs.us-east-1.amazonaws.com/123456789012/dagy-events-development
dagy-development.LambdaFunctionName = dagy-lambda-development
dagy-development.EventBridgeRuleArn = arn:aws:events:us-east-1:123456789012:rule/dagy-schedule-rule-development

Save these outputs for the next configuration steps.


Environment Variables Reference

The Dagy API Lambda requires the following environment variables to be set. These are automatically configured by the CDK stack based on the resources it creates.

Tables (Core)

VariableDescriptionRequired
DAGY_FLOWSFlows table nameYes
DAGY_DEPLOYMENTSDeployments table nameYes
DAGY_RUNSFlow runs table nameYes
DAGY_TASK_RUNSTask runs table nameYes
DAGY_SCHEDULESSchedules table nameYes

Tables (Authentication & Access)

VariableDescriptionRequired
DAGY_USERSUsers table nameYes
DAGY_ACCESS_TOKENSAccess tokens table nameYes
DAGY_ACCESS_LOGSAccess logs for auditingYes

Tables (Organizations & Teams)

VariableDescriptionRequired
DAGY_ORGANIZATIONSOrganizations table nameYes
DAGY_MEMBERSHIPSOrganization membershipsYes
DAGY_API_KEYSAPI keys for programmatic accessYes

Tables (Flow Builder)

VariableDescriptionRequired
DAGY_DAG_DRAFTSUnsaved DAG draftsYes

Tables (Billing & Usage)

VariableDescriptionRequired
DAGY_USAGE_EVENTSIndividual API call eventsYes
DAGY_USAGE_AGGREGATESAggregated usage metricsYes
DAGY_SUBSCRIPTIONSSubscription/plan informationYes

Tables (Enterprise Features)

VariableDescriptionRequired
DAGY_AUDIT_LOGSDetailed audit trailYes
DAGY_SECRETSEncrypted secrets storageYes
DAGY_NOTIFICATION_CHANNELSAlert notification destinationsYes
DAGY_ALERT_RULESAlert rule definitionsYes
DAGY_ENVIRONMENTSDeployment environmentsYes
DAGY_SENSORSSensor/trigger configurationsYes

Storage & Queue

VariableDescriptionRequired
DAGY_ARTIFACT_BUCKETS3 bucket for flow artifactsYes
DAGY_EVENTS_QUEUE_URLSQS queue URL for task eventsYes
DAGY_FLOW_EXECUTOR_FUNCTIONLambda function name/ARN for the Flow Executor backendNo (required for in-process/task-isolated execution modes)

Secrets & Encryption

VariableDescriptionRequired
DAGY_SECRETS_KEYFernet key for encrypting secretsNo (required if secrets used)

Format: Base64-encoded Fernet key. Generate with:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

JWT Authentication

VariableDescriptionRequired
DAGY_JWT_REQUIREDEnable JWT authentication (true/false)No (default: false)
DAGY_JWT_ISSUERJWT issuer URLYes if JWT required
DAGY_JWT_AUDIENCEJWT audience claimYes if JWT required
DAGY_JWKS_URLJWKS endpoint URLYes if JWT required
DAGY_ACCESS_TOKEN_TTL_SECONDSToken expiration in secondsNo (default: 86400 = 24 hours)

Stripe (Optional - For Billing)

VariableDescriptionRequired
STRIPE_SECRET_KEYStripe API secret keyNo (required for billing)
STRIPE_WEBHOOK_SECRETStripe webhook signing secretNo (required for billing)
STRIPE_PRICE_PROStripe price ID for Pro planNo
STRIPE_PRICE_ENTERPRISEStripe price ID for Enterprise planNo

Example Lambda Environment Variables Configuration

Via AWS Lambda console or CDK, set:

DAGY_FLOWS=dagy-flows-development
DAGY_DEPLOYMENTS=dagy-deployments-development
DAGY_RUNS=dagy-runs-development
DAGY_TASK_RUNS=dagy-task-runs-development
DAGY_SCHEDULES=dagy-schedules-development
DAGY_USERS=dagy-users-development
DAGY_ACCESS_TOKENS=dagy-access-tokens-development
DAGY_ACCESS_LOGS=dagy-access-logs-development
DAGY_ORGANIZATIONS=dagy-organizations-development
DAGY_MEMBERSHIPS=dagy-memberships-development
DAGY_API_KEYS=dagy-api-keys-development
DAGY_DAG_DRAFTS=dagy-dag-drafts-development
DAGY_USAGE_EVENTS=dagy-usage-events-development
DAGY_USAGE_AGGREGATES=dagy-usage-aggregates-development
DAGY_SUBSCRIPTIONS=dagy-subscriptions-development
DAGY_AUDIT_LOGS=dagy-audit-logs-development
DAGY_SECRETS=dagy-secrets-development
DAGY_NOTIFICATION_CHANNELS=dagy-notification-channels-development
DAGY_ALERT_RULES=dagy-alert-rules-development
DAGY_ENVIRONMENTS=dagy-environments-development
DAGY_SENSORS=dagy-sensors-development
DAGY_ARTIFACT_BUCKET=dagy-artifacts-123456789012-us-east-1-development
DAGY_EVENTS_QUEUE_URL=https://sqs.us-east-1.amazonaws.com/123456789012/dagy-events-development
DAGY_SECRETS_KEY=<base64-fernet-key-here>
DAGY_JWT_REQUIRED=false
DAGY_ACCESS_TOKEN_TTL_SECONDS=86400

Frontend Deployment

Option 1: Deploy on Vercel (Recommended)

Vercel provides the easiest deployment path with automatic builds, edge caching, and HTTPS.

Prerequisites

Steps

  1. Connect GitHub repository to Vercel

  2. Configure Clerk environment variables

    NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_live_xxxxx
    CLERK_SECRET_KEY=sk_live_xxxxx
    NEXT_PUBLIC_CLERK_SIGN_IN_URL=/sign-in
    NEXT_PUBLIC_CLERK_SIGN_UP_URL=/sign-up
    NEXT_PUBLIC_CLERK_AFTER_SIGN_IN_URL=/flows
    NEXT_PUBLIC_CLERK_AFTER_SIGN_UP_URL=/flows
    
  3. Configure API endpoint environment variables

    NEXT_PUBLIC_API_URL=https://api.dagy.io/app
    
  4. Deploy

    • Click "Deploy"
    • Vercel builds and deploys automatically on every push to main
    • Get your frontend URL (e.g., https://dagy.vercel.app)

Option 2: CloudFront + S3 Deployment

For organizations preferring AWS-only solutions.

Prerequisites

  • AWS CloudFront and S3 setup
  • ACM certificate for domain

Steps

  1. Build Next.js application

    cd web
    npm install
    npm run build
    
  2. Create S3 bucket for static exports

    aws s3 mb s3://dagy-frontend-production --region us-east-1
    
    # Enable static website hosting
    aws s3api put-bucket-website \
      --bucket dagy-frontend-production \
      --website-configuration '{
        "IndexDocument": {"Suffix": "index.html"},
        "ErrorDocument": {"Key": "404.html"}
      }'
    
  3. Upload built files

    # Export static site from Next.js
    npm run export  # Requires static export config in next.config.js
    
    # Sync to S3
    aws s3 sync out/ s3://dagy-frontend-production/ --delete
    
  4. Create CloudFront distribution

    # Create invalidation to clear cache
    aws cloudfront create-invalidation \
      --distribution-id E123ABC \
      --paths "/*"
    

Option 3: ECS Fargate Deployment

For full containerization within AWS.

Prerequisites

  • ECS cluster
  • ECR repository
  • ALB/NLB for routing

Steps

  1. Build Docker image

    cd web
    docker build -t dagy-frontend:latest -f Dockerfile.prod .
    
  2. Push to ECR

    aws ecr get-login-password --region us-east-1 | \
      docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
    
    docker tag dagy-frontend:latest \
      123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-frontend:latest
    
    docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-frontend:latest
    
  3. Create ECS task definition with:

    • Image: 123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-frontend:latest
    • Port: 3000
    • Environment variables for Clerk and API URL
  4. Create ECS service with ALB target group

Clerk Configuration

  1. Create Clerk application

  2. Get API keys

    • Navigate to "API Keys"
    • Copy "Publishable Key" and "Secret Key"
  3. Configure allowed origins

    • Go to "Domains"
    • Add your frontend domain(s)
    • Add API Gateway domain if using cross-origin auth
  4. Setup webhooks (for user sync to database)

    • Go to "Webhooks"
    • Create webhook for user.created and user.deleted events
    • Point to https://api.yourdomain.com/webhooks/clerk

Frontend Environment Variables

# Clerk Authentication
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY=pk_live_xxxxx
CLERK_SECRET_KEY=sk_live_xxxxx

# API Configuration
NEXT_PUBLIC_API_URL=https://api.dagy.io/app
NEXT_PUBLIC_API_VERSION=v1

# Clerk URLs
NEXT_PUBLIC_CLERK_SIGN_IN_URL=/sign-in
NEXT_PUBLIC_CLERK_SIGN_UP_URL=/sign-up
NEXT_PUBLIC_CLERK_AFTER_SIGN_IN_URL=/flows
NEXT_PUBLIC_CLERK_AFTER_SIGN_UP_URL=/flows

# Optional: Analytics, error tracking, etc.
NEXT_PUBLIC_SENTRY_DSN=https://xxxxx@sentry.io/xxxxx

CORS Configuration for API

If frontend and API are on different domains, configure CORS in CDK:

# In infrastructure/develop.yml
dagy:
  api_cors_allowed_origins:
    - "https://dagy.yourdomain.com"
    - "https://dagy.vercel.app"
    - "http://localhost:3000"  # For local development

Backend Configuration

Execution Backends

Dagy supports three execution backends for running tasks. Configure which backends are available in your environment.

1. Lambda Backend (Default)

Simplest option; no additional configuration needed beyond Lambda function permissions.

# In dagy_api/backends/lambda_backend.py
# Automatically invokes task functions as Lambda functions
# Default concurrency: 1000 (account limit)

Pros:

  • Zero infrastructure management
  • Automatic scaling
  • Pay-per-execution pricing

Cons:

  • 15-minute timeout limit per execution
  • Limited to Lambda execution environment
  • Cold starts impact latency

Configuration:

  • Set EXECUTION_BACKEND=lambda (default)
  • No additional environment variables needed

2. Step Functions Backend

For complex workflow orchestration with state machines.

Pros:

  • 1-year execution duration
  • Complex branching and retry logic
  • Visual workflow monitoring in AWS Console

Cons:

  • More expensive per execution
  • Requires separate state machine definition
  • Additional complexity

Setup:

  1. Create IAM role for Step Functions:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "lambda:InvokeFunction",
        "states:StartExecution"
      ],
      "Resource": "*"
    }
  ]
}
  1. Configure in CDK:
# infrastructure/develop.yml
dagy:
  step_functions_role_arn: "arn:aws:iam::123456789012:role/step-functions-role"

3. ECS Fargate Backend

For long-running tasks, custom dependencies, or GPU workloads.

Pros:

  • Full container control
  • GPU support via instance types
  • Custom runtimes and libraries
  • 15-hour task duration

Cons:

  • Requires ECS cluster management
  • Higher baseline costs
  • More operational overhead

Setup:

  1. Create ECS cluster:
aws ecs create-cluster --cluster-name dagy-tasks

# Create CloudWatch log group
aws logs create-log-group --log-group-name /ecs/dagy-tasks
  1. Create task definition:
aws ecs register-task-definition \
  --family dagy-task-worker \
  --network-mode awsvpc \
  --requires-compatibilities FARGATE \
  --cpu 256 \
  --memory 512 \
  --container-definitions '{
    "name": "task-worker",
    "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/dagy-worker:latest",
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/dagy-tasks",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }'
  1. Configure in CDK:
# infrastructure/develop.yml
dagy:
  ecs_cluster_name: "dagy-tasks"
  ecs_task_definition_arn: "arn:aws:ecs:us-east-1:123456789012:task-definition/dagy-task-worker:1"
  ecs_subnets:
    - "subnet-12345678"
  ecs_security_groups:
    - "sg-12345678"

Rate Limiting Configuration

Dagy includes token bucket rate limiting to prevent abuse.

Default configuration:

  • 120 requests per minute per API key/IP
  • 200 token burst capacity

Customize in src/dagy_api/app.py:

from dagy_api.rate_limit import RateLimitMiddleware

app.add_middleware(
    RateLimitMiddleware,
    requests_per_minute=120,  # Requests per minute
    burst_size=200             # Max burst tokens
)

Rate limit headers in responses:

X-RateLimit-Limit: 120
X-RateLimit-Remaining: 45
Retry-After: 30

RBAC (Role-Based Access Control)

Dagy implements organization and membership-based RBAC.

Roles:

  • Owner: Full access to organization
  • Admin: Manage flows, runs, schedules, users
  • Developer: Create and run flows
  • Viewer: Read-only access

Membership management:

  • Store in DAGY_MEMBERSHIPS table
  • Contains: user_id, org_id, role
  • Check role on every API request

Example permission check:

from dagy_api.auth import check_org_permission

async def create_flow(org_id: str, request: Request):
    check_org_permission(
        org_id=org_id,
        user_id=request.state.user_id,
        required_role="developer"
    )
    # ... create flow

Security Configuration

JWT Authentication Setup

Enable JWT authentication to require valid tokens for all API requests.

Prerequisites

  • JWT issuer (e.g., Auth0, Clerk, Cognito)
  • JWKS (JSON Web Key Set) endpoint
  • JWT issuer URL and audience

Enable JWT Authentication

  1. Configure in environment YAML:
# infrastructure/develop.yml
dagy:
  jwt_required: true
  jwt_issuer: "https://your-auth-provider.com"
  jwt_audience: "dagy-api"
  jwks_url: "https://your-auth-provider.com/.well-known/jwks.json"
  1. Set Lambda environment variables:
DAGY_JWT_REQUIRED=true
DAGY_JWT_ISSUER=https://your-auth-provider.com
DAGY_JWT_AUDIENCE=dagy-api
DAGY_JWKS_URL=https://your-auth-provider.com/.well-known/jwks.json
  1. Redeploy CDK stack:
cd infrastructure
cdk deploy --env develop

JWT Validation Flow

1. Client sends request: Authorization: Bearer eyJhbGc...
2. API Gateway HttpJwtAuthorizer validates token
3. JWKS endpoint verifies signature
4. Lambda receives request with claims in context
5. API checks scopes/permissions

API Key Management

For programmatic access, implement API keys as alternative to JWT.

API key format: dagy_[base32-encoded-32-random-bytes]

Example:

dagy_JBSWY3DPEBLW64TMMQ======

Storage:

  • Hash API keys before storing in DAGY_API_KEYS table
  • Use bcrypt or Argon2

Validation in middleware:

async def validate_api_key(request: Request):
    auth_header = request.headers.get("authorization", "")
    if auth_header.startswith("Bearer dagy_"):
        api_key = auth_header.split(" ", 1)[1]
        # Hash and look up in database
        # Set request.state.user_id and request.state.org_id

Secrets Encryption

Store sensitive data (API keys, passwords, credentials) encrypted.

Generate Fernet Encryption Key

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
# Output: gAAAAABl4_2_wqtfT8qjU...

Set in Lambda

aws lambda update-function-configuration \
  --function-name dagy-api-lambda \
  --environment Variables={DAGY_SECRETS_KEY=gAAAAABl4_2_wqtfT8qjU...}

Encrypt Secrets in Code

from cryptography.fernet import Fernet
import os

key = os.getenv("DAGY_SECRETS_KEY").encode()
cipher = Fernet(key)

def encrypt_secret(value: str) -> str:
    return cipher.encrypt(value.encode()).decode()

def decrypt_secret(encrypted: str) -> str:
    return cipher.decrypt(encrypted.encode()).decode()

# Usage
encrypted = encrypt_secret("my-api-key-123")
decrypted = decrypt_secret(encrypted)

Store in DAGY_SECRETS Table

# Database schema
{
  "secret_id": "sec_12345",           # Partition key
  "org_id": "org_abc",                # Sort key
  "name": "github-token",             # Friendly name
  "encrypted_value": "gAAAAABl...",  # Encrypted value
  "created_at": 1704067200,
  "created_by": "user_123"
}

VPC and Security Groups

For production, deploy in a VPC with restricted network access.

Configure VPC in CDK

# infrastructure/production.yml
dagy:
  vpc_id: "vpc-prod123456"
  subnet_ids:
    - "subnet-prod111111"  # Private subnet 1
    - "subnet-prod222222"  # Private subnet 2
  security_group_ids:
    - "sg-prod-api"       # Security group for Lambda

Security Group Rules

# Allow API Gateway to invoke Lambda
aws ec2 authorize-security-group-ingress \
  --group-id sg-prod-api \
  --protocol tcp \
  --port 443 \
  --cidr 0.0.0.0/0

# Allow Lambda to database (same SG)
aws ec2 authorize-security-group-ingress \
  --group-id sg-prod-db \
  --protocol tcp \
  --port 443 \
  --source-group sg-prod-api

Lambda VPC Execution

Lambda functions in VPC require:

  • ENI in private subnets
  • NAT Gateway for external access (S3, database endpoints)
  • VPC endpoints for AWS services
# Create VPC endpoint for S3
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-prod123456 \
  --service-name com.amazonaws.us-east-1.s3 \
  --route-table-ids rtb-prod111111

# Create VPC endpoint for database
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-prod123456 \
  --service-name com.amazonaws.us-east-1.dynamodb \
  --route-table-ids rtb-prod111111

IAM Least-Privilege Policies

Create minimal IAM roles for Lambda function.

Lambda execution role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DatabaseAccess",
      "Effect": "Allow",
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "dynamodb:Query",
        "dynamodb:Scan",
        "dynamodb:DeleteItem"
      ],
      "Resource": [
        "arn:aws:dynamodb:us-east-1:123456789012:table/dagy-flows-*",
        "arn:aws:dynamodb:us-east-1:123456789012:table/dagy-runs-*",
        "arn:aws:dynamodb:us-east-1:123456789012:table/dagy-*"
      ]
    },
    {
      "Sid": "S3ArtifactAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::dagy-artifacts-*/*"
    },
    {
      "Sid": "SQSAccess",
      "Effect": "Allow",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage"
      ],
      "Resource": "arn:aws:sqs:us-east-1:123456789012:dagy-events-*"
    },
    {
      "Sid": "LambdaInvoke",
      "Effect": "Allow",
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:dagy-*"
    },
    {
      "Sid": "CloudWatchLogs",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/dagy-*"
    }
  ]
}

Monitoring & Observability

Health Check Endpoints

Dagy provides health check endpoints for monitoring.

Endpoints:

GET /health
GET /health/detailed

Example health response:

{
  "status": "healthy",
  "timestamp": "2024-03-01T12:00:00Z",
  "version": "0.1.0"
}

Detailed health response:

{
  "status": "healthy",
  "timestamp": "2024-03-01T12:00:00Z",
  "version": "0.1.0",
  "components": {
    "database": {
      "status": "healthy",
      "latency_ms": 42
    },
    "s3": {
      "status": "healthy",
      "latency_ms": 156
    },
    "sqs": {
      "status": "healthy",
      "latency_ms": 31
    }
  }
}

Configure health checks:

# In application load balancer
aws elbv2 create-target-group \
  --name dagy-api \
  --protocol HTTP \
  --port 80 \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 5 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3

CloudWatch Metrics and Alarms

Enable Custom Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')

def publish_metric(metric_name: str, value: float, unit: str = "Count"):
    cloudwatch.put_metric_data(
        Namespace='Dagy',
        MetricData=[
            {
                'MetricName': metric_name,
                'Value': value,
                'Unit': unit,
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'Service', 'Value': 'api'}
                ]
            }
        ]
    )

# Example: Track flow execution
publish_metric('FlowExecutionTime', execution_time_ms, 'Milliseconds')
publish_metric('FailedRuns', 1, 'Count')

Create Alarms

# Lambda error rate alarm
aws cloudwatch put-metric-alarm \
  --alarm-name dagy-lambda-errors \
  --alarm-description "Alert if Lambda error rate > 1%" \
  --metric-name Errors \
  --namespace AWS/Lambda \
  --statistic Sum \
  --period 300 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

# Database throttling alarm
aws cloudwatch put-metric-alarm \
  --alarm-name dagy-database-throttles \
  --alarm-description "Alert if database is throttled" \
  --metric-name ConsumedWriteCapacityUnits \
  --namespace AWS/DynamoDB \
  --statistic Sum \
  --period 60 \
  --threshold 100 \
  --comparison-operator GreaterThanThreshold \
  --evaluation-periods 1 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:alerts

Audit Logging Configuration

Store all user actions in DAGY_AUDIT_LOGS for compliance and debugging.

Audit log schema:

{
  "audit_id": "aud_abc123",            # Partition key
  "timestamp": 1704067200,             # Sort key
  "org_id": "org_xyz",
  "user_id": "user_123",
  "action": "flow_created",
  "resource_type": "flow",
  "resource_id": "flow_abc",
  "changes": {
    "name": {"old": null, "new": "my-flow"},
    "version": {"old": null, "new": "1.0.0"}
  },
  "ip_address": "203.0.113.42",
  "user_agent": "Mozilla/5.0...",
  "status": "success"  # or "failure"
}

Log all significant actions:

from dagy_api.audit import log_audit_event

async def create_flow(org_id: str, data: FlowData, request: Request):
    # Create flow...

    # Log audit event
    log_audit_event(
        org_id=org_id,
        user_id=request.state.user_id,
        action="flow_created",
        resource_type="flow",
        resource_id=flow.id,
        changes={"name": {"old": None, "new": flow.name}},
        ip_address=request.client.host,
        status="success"
    )

Alert Rules for Pipeline Monitoring

Configure alerts to notify on flow execution failures, SLA breaches, etc.

Alert rule schema:

{
  "rule_id": "rule_123",               # Partition key
  "org_id": "org_xyz",                 # Sort key
  "name": "High failure rate",
  "enabled": True,
  "condition": {
    "metric": "run_failure_rate",
    "threshold": 0.1,  # 10% failure rate
    "window_minutes": 5,
    "operator": "GreaterThan"
  },
  "notification_channels": ["channel_slack_123"],
  "created_at": 1704067200
}

Example: Create alert rule via API

curl -X POST https://api.example.com/alerts/rules \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Flow failure rate",
    "metric": "run_failure_rate",
    "threshold": 0.15,
    "window_minutes": 10,
    "notification_channel_ids": ["channel_123"]
  }'

Managing Deployment Settings

After a flow is deployed, its runtime settings can be updated without redeploying the artifact. This is useful for changing execution strategy, adjusting schedules, or attaching new dependency packages.

Settings UI

The Dagy web UI provides a Flow Settings dialog accessible from the Flows page. Click the dropdown menu on any flow row and select Settings, or use the Flow Settings button in the flow detail panel.

The settings dialog allows you to update:

  • Runtime tier: Choose from nano through xlarge runtime tiers based on workload size
  • Default executor: Auto (determined by tier)
  • Schedule: Set or change the cron expression or interval for automated runs
  • Dependency packages: Attach or remove dependency package slugs resolved at runtime
  • Tags: Add, update, or remove key-value metadata tags

Changes that affect runtime behavior (runtime tier, schedule, dependency packages) trigger a confirmation prompt before saving. Existing in-progress runs are not affected; changes apply to new runs only.

Settings API

Use PUT /deployments/{name}/settings to update settings programmatically. Only the fields included in the request body are updated:

curl -X PUT https://api.dagy.io/v1/deployments/daily-etl/settings \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{
    "execution_mode": "micro",
    "schedule": "0 9 * * 1-5",
    "dep_package_slugs": ["pandas-layer"]
  }'

See the API Reference for the full field list and response schema.


Scaling & Performance

Database Capacity Planning

The database supports two billing modes:

On-Demand (Default)

  • Recommended for variable workloads
  • Auto-scales capacity
  • Pay per request

Enable on-demand:

# CDK automatically defaults to on-demand
# In dagy_stack.py:
table = dynamodb.Table(
    self, "flows-table",
    partition_key=Attribute(name="flow_id", type=AttributeType.STRING),
    billing_mode=BillingMode.PAY_PER_REQUEST  # Default
)

Provisioned

  • Better for predictable workloads
  • Cheaper at higher scale
  • Requires capacity planning

Example: Switch to provisioned

table = dynamodb.Table(
    self, "flows-table",
    partition_key=Attribute(name="flow_id", type=AttributeType.STRING),
    billing_mode=BillingMode.PROVISIONED,
    read_capacity=100,  # RCUs
    write_capacity=100  # WCUs
)

# Enable auto-scaling
table.auto_scale_read_capacity(
    min_capacity=10,
    max_capacity=1000
)
table.auto_scale_write_capacity(
    min_capacity=10,
    max_capacity=1000
)

Capacity Calculation

For read capacity:

  • 1 RCU = 1 strongly consistent read/sec or 2 eventually consistent reads/sec
  • Example: 1000 flow reads/minute = 17 RCUs minimum

For write capacity:

  • 1 WCU = 1 write/sec
  • Example: 500 flow creates/minute = 9 WCUs minimum

Add 30% buffer for spikes:

Required RCUs = (peak_reads_per_sec / 1.0) * 1.3
Required WCUs = (peak_writes_per_sec / 1.0) * 1.3

Lambda Concurrency Limits

Control Lambda concurrency to manage costs and prevent throttling.

Account concurrency: Default 1000 concurrent executions

Set function concurrency:

aws lambda put-function-concurrency \
  --function-name dagy-api-lambda \
  --reserved-concurrent-executions 500

Reserved concurrency:

  • Guarantees capacity for critical functions
  • Deducts from account total
  • Useful for production environments

Cold start optimization:

# Provision capacity to eliminate cold starts
aws lambda put-provisioned-concurrency-config \
  --function-name dagy-api-lambda \
  --provisioned-concurrent-executions 50 \
  --qualifier LIVE

ECS Task Scaling

Auto-scale ECS tasks based on CPU/memory metrics.

# Create auto-scaling target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/dagy-tasks/dagy-worker \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 2 \
  --max-capacity 100

# CPU-based scaling policy
aws application-autoscaling put-scaling-policy \
  --policy-name scale-by-cpu \
  --service-namespace ecs \
  --resource-id service/dagy-tasks/dagy-worker \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

SQS Visibility Timeout Tuning

Configure SQS visibility timeout to match task execution time.

Default: 30 seconds

Set visibility timeout:

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/dagy-events \
  --attributes VisibilityTimeout=300  # 5 minutes for longer tasks

Recommended:

  • Short tasks (< 1 min): 120 seconds
  • Medium tasks (1-5 min): 300 seconds
  • Long tasks (5+ min): 900 seconds

Performance Optimization Checklist

  • Enable database auto-scaling or set appropriate capacity
  • Configure Lambda reserved concurrency for predictable workloads
  • Use VPC endpoints for private access to AWS services
  • Enable database caching for hot data
  • Implement query pagination for large result sets
  • Use batch operations for bulk inserts/updates
  • Configure appropriate CloudWatch metrics and alarms
  • Set up CloudFront caching for static assets
  • Implement database connection pooling
  • Monitor Lambda cold start times and optimize layer size

Backup & Disaster Recovery

Database Point-in-Time Recovery

Enable automatic backups for all database tables.

# Enable PITR for all tables
for table in dagy-flows-dev dagy-runs-dev dagy-deployments-dev; do
  aws dynamodb update-continuous-backups \
    --table-name $table \
    --point-in-time-recovery-specification \
    PointInTimeRecoveryEnabled=true
done

Restore from backup:

# Restore to a specific time
aws dynamodb restore-table-to-point-in-time \
  --source-table-name dagy-runs-dev \
  --target-table-name dagy-runs-dev-restored \
  --restore-date-time 2024-03-01T12:00:00Z

S3 Versioning for Artifacts

Enable versioning on artifact buckets to protect against accidental deletion.

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket dagy-artifacts-123456789012-us-east-1-development \
  --versioning-configuration Status=Enabled

# Enable lifecycle policy to expire old versions after 90 days
aws s3api put-bucket-lifecycle-configuration \
  --bucket dagy-artifacts-123456789012-us-east-1-development \
  --lifecycle-configuration '{
    "Rules": [
      {
        "Id": "expire-old-versions",
        "Status": "Enabled",
        "NoncurrentVersionExpirationInDays": 90
      }
    ]
  }'

Cross-Region Replication

For disaster recovery, replicate critical data across regions.

# Create replication role
aws iam create-role \
  --role-name s3-replication-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {"Service": "s3.amazonaws.com"},
        "Action": "sts:AssumeRole"
      }
    ]
  }'

# Attach replication policy
aws iam put-role-policy \
  --role-name s3-replication-role \
  --policy-name replication \
  --policy-document '{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Action": ["s3:GetReplicationConfiguration", "s3:ListBucket"],
        "Resource": "arn:aws:s3:::dagy-artifacts-*"
      },
      {
        "Effect": "Allow",
        "Action": ["s3:GetObjectVersionForReplication", "s3:GetObjectVersionAcl"],
        "Resource": "arn:aws:s3:::dagy-artifacts-*/*"
      },
      {
        "Effect": "Allow",
        "Action": ["s3:ReplicateObject", "s3:ReplicateDelete"],
        "Resource": "arn:aws:s3:::dagy-artifacts-replica/*"
      }
    ]
  }'

# Create replica bucket in different region
aws s3api create-bucket \
  --bucket dagy-artifacts-replica \
  --region us-west-2 \
  --create-bucket-configuration LocationConstraint=us-west-2

# Enable replication
aws s3api put-bucket-replication \
  --bucket dagy-artifacts-123456789012-us-east-1-development \
  --replication-configuration '{
    "Role": "arn:aws:iam::123456789012:role/s3-replication-role",
    "Rules": [
      {
        "Status": "Enabled",
        "Priority": 1,
        "DeleteMarkerReplication": {"Status": "Enabled"},
        "Filter": {"Prefix": ""},
        "Destination": {
          "Bucket": "arn:aws:s3:::dagy-artifacts-replica",
          "ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
          "Metrics": {"Status": "Enabled", "EventThreshold": {"Minutes": 15}}
        }
      }
    ]
  }'

Backup Strategy

Recommended backup frequency:

  • Database: Continuous (PITR enabled) + daily snapshots
  • S3: Versioning enabled + cross-region replication
  • Secrets: Encrypted backups in separate AWS account

Restore procedure:

  1. Restore database tables from PITR to new table
  2. Verify data integrity in non-production
  3. Update Lambda environment variables to point to restored tables
  4. Validate API functionality
  5. Gradually shift traffic to restored environment

Upgrading

CDK Stack Updates

Minor updates (configuration changes, security patches):

cd infrastructure

# Review changes
cdk diff --env production

# Deploy
cdk deploy --env production \
  --require-approval never

Database Migration Considerations

Schema changes:

  • The database is schemaless, but application code enforces structure
  • Add backward compatibility for new fields
  • Use feature flags to enable new functionality gradually

Example: Add new field with default

# Old code
run = {
    "run_id": "run_123",
    "status": "completed"
}

# New code - handle missing field
status = run.get("completion_time", None)

Migrate existing records:

import boto3

dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("dagy-runs-prod")

# Scan and update all records
response = table.scan()
for item in response["Items"]:
    if "completion_time" not in item:
        table.update_item(
            Key={"run_id": item["run_id"]},
            UpdateExpression="SET completion_time = :ct",
            ExpressionAttributeValues={":ct": 0}
        )

Zero-Downtime Deployment Strategy

Blue-Green Deployment:

  1. Deploy new Lambda version alongside existing version
  2. Update 10% of traffic to new version using API Gateway weighted routing
  3. Monitor errors and metrics
  4. Gradually increase traffic: 25% → 50% → 100%
  5. Rollback immediately if issues detected
# Create alias for traffic shifting
aws lambda create-alias \
  --function-name dagy-api \
  --name LIVE \
  --function-version 1

# Update alias to shift traffic
aws lambda update-alias \
  --function-name dagy-api \
  --name LIVE \
  --function-version 2 \
  --routing-config AdditionalVersionWeight=0.1  # 10% to v2, 90% to v1

Canary Deployment:

# API Gateway canary setting
aws apigatewayv2 create-deployment \
  --api-id abc123 \
  --stage-name prod \
  --canary-settings traceEnabled=true,useStageCache=false,percentTraffic=10

Troubleshooting

Common Deployment Issues

Issue: CDK bootstrap fails

Symptoms: TemplateURL must be a valid S3 URL

Solution:

# Ensure AWS credentials are correct
aws sts get-caller-identity

# Try bootstrap again with explicit parameters
cdk bootstrap aws://123456789012/us-east-1 \
  --profile default \
  --force

Issue: Lambda cannot access the database

Symptoms: User: arn:aws:lambda:... is not authorized to perform: dynamodb:GetItem

Solution:

# Verify Lambda execution role has database permissions
aws iam list-attached-role-policies \
  --role-name dagy-lambda-role

# Attach database policy if missing
aws iam attach-role-policy \
  --role-name dagy-lambda-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess

Issue: Lambda environment variables not set

Symptoms: KeyError: 'DAGY_FLOWS'

Solution:

# Check current Lambda config
aws lambda get-function-configuration \
  --function-name dagy-api-lambda \
  --query Environment

# Update environment variables
aws lambda update-function-configuration \
  --function-name dagy-api-lambda \
  --environment Variables={DAGY_FLOWS=dagy-flows-dev,DAGY_RUNS=dagy-runs-dev}

Health Check Debugging

Lambda health endpoint failing

# Test health endpoint directly
curl -X GET \
  https://xyz123.execute-api.us-east-1.amazonaws.com/health

# Check Lambda logs
aws logs tail /aws/lambda/dagy-api-lambda --follow

# Invoke Lambda directly for debugging
aws lambda invoke \
  --function-name dagy-api-lambda \
  --payload '{"resource": "/health", "httpMethod": "GET"}' \
  response.json

cat response.json

Database connectivity issues

# Test database connection
import boto3

dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
table = dynamodb.Table("dagy-flows-dev")

try:
    response = table.get_item(Key={"flow_id": "test"})
    print("Database connectivity: OK")
except Exception as e:
    print(f"Database error: {e}")

Lambda Cold Start Optimization

Measure cold start time

import time

start = time.time()

# Lambda handler code

cold_start_ms = (time.time() - start) * 1000
print(f"Cold start time: {cold_start_ms}ms")

Cold start is > 500ms? Consider:

  • Reducing Lambda package size (250MB max)
  • Using Lambda layers for dependencies
  • Provisioned concurrency for predictable traffic
  • Moving to Lambda@Edge for reduced latency

Optimize Lambda package size

# Check current package size
aws lambda get-function \
  --function-name dagy-api-lambda \
  --query 'Configuration.CodeSize'

# Remove unnecessary files from Docker image
# In Dockerfile:
RUN find . -name "*.pyc" -delete
RUN find . -name "__pycache__" -type d -delete

Database Throttling

Symptoms

  • ProvisionedThroughputExceededException
  • Lambda timeouts
  • API 5xx errors

Solutions

# Check consumed capacity
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedWriteCapacityUnits \
  --dimensions Name=TableName,Value=dagy-flows-dev \
  --start-time 2024-03-01T00:00:00Z \
  --end-time 2024-03-01T23:59:59Z \
  --period 300 \
  --statistics Sum

# Increase capacity (for provisioned mode)
aws dynamodb update-table \
  --table-name dagy-flows-dev \
  --provisioned-throughput ReadCapacityUnits=200,WriteCapacityUnits=200

# Or switch to on-demand mode
aws dynamodb update-table \
  --table-name dagy-flows-dev \
  --billing-mode PAY_PER_REQUEST

Optimization strategies

  • Enable database change streams for CDC
  • Use batch operations for bulk reads and writes
  • Implement query pagination
  • Use TTL for automatic data cleanup
  • Create secondary indexes for common queries

Rate Limiting Issues

Symptoms

  • 429 Too Many Requests responses
  • X-RateLimit-Remaining: 0 header

Adjust rate limits

# In src/dagy_api/app.py
app.add_middleware(
    RateLimitMiddleware,
    requests_per_minute=240,  # Increased from 120
    burst_size=400             # Increased from 200
)

Test rate limiting

# Send 150 requests in 1 minute
for i in {1..150}; do
  curl -X GET \
    https://api.example.com/health \
    -H "Authorization: Bearer dagy_test" \
    -w "Status: %{http_code}\n"
done

Frontend Issues

Clerk authentication not working

# Check Clerk keys are correctly set
echo $NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY

# Verify domain is allowed in Clerk dashboard
# Settings → Domains → check yourdomain.com is listed

# Check webhook is configured
# Settings → Webhooks → user.created event points to API

API CORS errors

Access to XMLHttpRequest at 'https://api.example.com/flows'
from origin 'https://frontend.example.com' has been blocked by CORS policy

Solution: Update CDK configuration

# infrastructure/production.yml
dagy:
  api_cors_allowed_origins:
    - "https://frontend.example.com"
    - "https://www.example.com"

Then redeploy:

cdk deploy --env production

Monitoring and Debugging

Enable verbose logging

# In Lambda environment
DAGY_LOG_LEVEL=DEBUG

# In local testing
export DAGY_LOCAL_VERBOSE=true
uv run python -m dagy_api.app

CloudWatch Insights queries

# Find errors in logs
fields @timestamp, @message, @logStream
| filter @message like /error|exception/i
| stats count() as errors by @logStream

# Track API latency
fields @duration
| stats avg(@duration), max(@duration), pct(@duration, 99)

# Find slow database queries
fields @duration, @message
| filter @message like /database|DynamoDB/
| stats pct(@duration, 95), pct(@duration, 99)

Support and Resources

Next Steps

  1. Complete the deployment steps above
  2. Run health checks on the API: curl https://api.example.com/health
  3. Deploy the frontend and configure authentication
  4. Create your first flow using the SDK
  5. Deploy the flow and execute it via the API
  6. Set up monitoring, alerting, and backup policies
  7. Configure RBAC and security policies for your organization

Version: 1.0.0 Last Updated: March 2024 Author: Dagy Team