Back to docs
Architecture

Components

Dagy is composed of several distinct components working together. This document describes each component's responsibilities, interfaces, and implementation.

Control Plane

API Lambda

The central component serving all REST endpoints. Built with FastAPI and deployed to AWS Lambda via the Mangum adapter.

Responsibilities:

  • Serve 114 REST API endpoints
  • Authenticate requests (access tokens, API keys, JWT)
  • Enforce RBAC permissions on all endpoints
  • Apply rate limiting (token-bucket, 120 req/min)
  • Audit-log all mutations
  • Route events to SQS for async processing

Key files:

  • src/dagy_api/app.py: FastAPI application and all endpoint definitions
  • src/dagy_api/auth.py: Authentication and RBAC logic
  • src/dagy_api/rate_limit.py: Token-bucket rate limiter middleware
  • src/dagy_api/models.py: Pydantic request/response models
  • src/dagy_api/config.py: Settings from environment variables

Event Processor (SQS Consumer)

Processes asynchronous events from the SQS queue. Runs as the same Lambda function but triggered by SQS instead of API Gateway.

Responsibilities:

  • Execute flow runs via the selected backend
  • Process scheduler ticks (find due schedules, trigger runs)
  • Poll async backend status (Step Functions, ECS)
  • Update run status and trigger notifications

Key files:

  • src/dagy_api/events.py: Event routing and processing logic
  • src/dagy_api/event_bus.py: SQS message publishing
  • src/dagy_api/execution.py: Direct execution from artifact

Scheduler

Implemented as an EventBridge rule that sends scheduler_tick events to SQS every minute. The SQS consumer processes ticks by querying due schedules and triggering runs.

Responsibilities:

  • Query due schedules (enabled schedules where next_run_epoch is in the past)
  • Trigger runs for due schedules
  • Update schedule metadata (next_run_at, last_triggered_at)
  • Handle catchup policies (none, all)

Execution Backends

Backend Abstraction

All backends implement the ExecutionBackend abstract base class:

class ExecutionBackend(ABC):
    @property
    def name(self) -> str: ...
    @property
    def capabilities(self) -> BackendCapabilities: ...

    def submit_run(...) -> BackendRunResult: ...
    def get_run_status(...) -> BackendRunResult: ...
    def cancel_run(...) -> BackendRunResult: ...
    def get_logs(...) -> List[BackendLogEntry]: ...

Lambda Backend

Synchronous execution within the Lambda runtime. Best for short, simple tasks.

PropertyValue
Max duration15 minutes (Lambda timeout)
Max memory10 GB
Parallel tasksNo (sequential)
Cost modelPer-invocation + duration
Native retryNo (Dagy SDK handles retries)

Key file: src/dagy_api/backends/lambda_backend.py

Step Functions Backend

Orchestrates flows as AWS Step Functions state machines. Best for parallel workflows.

PropertyValue
Max duration1 year
Max memoryN/A (delegates to Lambda)
Parallel tasksYes (Parallel state)
Cost modelPer state transition ($0.000025)
Native retryYes (ASL Retry policies)

Translates FlowSpec to Amazon States Language (ASL) with support for sequential chains, parallel branches, retry policies, and timeouts.

Key file: src/dagy_api/backends/step_functions.py

ECS Fargate Backend

Runs flows as containerized tasks on ECS Fargate. Best for resource-intensive workloads.

PropertyValue
Max durationUnlimited
Max memory120 GB
Parallel tasksYes
Cost modelPer vCPU-hour + GB-hour
Native retryNo (Dagy handles retries)

Supports custom Docker images, flexible CPU/memory allocation, VPC networking, and CloudWatch log streaming.

Key file: src/dagy_api/backends/ecs.py

Backend Router

Automatically selects the optimal backend using a 5-level resolution strategy:

  1. Explicit: Per-run executor parameter override
  2. Deployment: default_executor on the deployment
  3. Flow: default_executor on the flow
  4. Rules engine: Duration, resource, and complexity rules
  5. Default: Falls back to Lambda

Routing rules:

  • Memory > 3 GB → ECS
  • Duration > 1 hour → ECS
  • Duration 15 min–1 hour → Step Functions
  • Task count > 50 → Step Functions
  • Default → Lambda

Key files:

  • src/dagy_api/backends/router.py: Router and rules
  • src/dagy_api/backends/registry.py: Backend registration

Persistence Layer

All data access goes through a repository layer that provides org-scoped CRUD methods. The platform manages 21 entities organized by domain:

DomainEntities
ExecutionFlow, Deployment, Run, TaskRun, Schedule
AuthUser, AccessToken, AccessLog
OrganizationOrganization, Membership, ApiKey
Flow BuilderDagDraft
BillingUsageEvent, UsageAggregate, Subscription
EnterpriseAuditLog, Secret, NotificationChannel, AlertRule, Environment, Sensor

See Data Model for entity schemas and relationships.

S3 Artifact Store

Stores flow artifacts (ZIP packages) uploaded during deployment.

  • Bucket enforces SSL and blocks public access
  • Artifacts organized as: dagy/flows/{flow_name}/{version}/artifact.zip
  • Retrieved by execution backends at run time

Enterprise Features

Audit Logging

Captures all mutation operations across all endpoints.

  • Org-scoped audit entries
  • Per-resource audit trail queries
  • Records: actor_email, resource_type, resource_id, action, before/after JSON
  • log_audit_event() called from every mutation endpoint (wrapped in try/except)

Key file: src/dagy_api/audit.py

Secret Management

Fernet-encrypted secrets with per-org and per-environment scoping.

  • Encryption key from DAGY_SECRETS_KEY environment variable
  • Secrets encrypted at rest, decrypted only via GET /secrets/{name}/value
  • Environment-specific overrides (e.g., different DB credentials per env)

Key file: src/dagy_api/secrets.py

Notifications

Multi-channel alerting system with configurable rules.

  • Channel types: Slack, email, webhook, PagerDuty
  • Alert triggers: on_failure, on_success, on_sla_breach, on_retry
  • evaluate_and_dispatch() called from events.py on terminal run states
  • Optional flow_name filter for flow-specific alerts

Key file: src/dagy_api/notifications.py

Health Monitoring

Component-level health checks with latency measurement.

  • Checks database, S3, and SQS connectivity
  • Returns healthy/degraded/unhealthy status
  • /health (public) and /health/detailed (authenticated)

Key file: src/dagy_api/sla.py

Frontend

Next.js Application

React-based SPA with Clerk authentication, served via Vercel or CloudFront+S3.

Key pages:

RoutePurpose
/dashboardUsage summary, recent runs, top flows
/runsRun list with status filtering
/flowsFlow catalog and versions
/flow-builderVisual flow editor (React Flow)
/eventsSchedules and sensors
/observabilityMetrics and monitoring
/environmentsEnvironment and secrets management
/teamRBAC and member management
/settingsAudit logs, notifications, API keys

Key libraries:

  • React Flow (@xyflow/react) for DAG canvas
  • React Query (@tanstack/react-query) for API data fetching
  • Clerk (@clerk/nextjs) for authentication
  • Tailwind CSS for styling

Infrastructure (CDK)

All AWS resources are defined in infrastructure/dagy_stack.py using AWS CDK:

  • 21 database tables (on-demand billing)
  • 1 S3 bucket (artifacts, versioned)
  • 1 SQS queue (event processing)
  • 1 Lambda function (API + SQS consumer)
  • 1 ECS cluster (Fargate capacity)
  • 1 ECR repository (custom worker images)
  • IAM roles for Lambda, Step Functions, ECS
  • CloudWatch log groups