Orchestrating Long-Running Workflows - Lessons from Production
The Problem I Faced
Three months into a major platform initiative, I encountered a scaling limitation that required architectural intervention.
I had built an integration pipeline connecting the platform with external vendor systems. The workflow: receive a request, submit it to the vendor's API, and wait for their webhook callback confirming completion. The challenge? Vendor processing times ranged from 30 seconds to several hours, and some required manual review on their end.
My initial implementation used database polling every 5 seconds to check for webhook arrivals. Lambda functions were constantly spinning. CloudWatch costs were escalating, and the architecture created unnecessary coupling between systems.
I spent a weekend researching alternatives. What I discovered transformed my approach to external system integration.
The Callback Pattern
The solution was Step Functions' .waitForTaskToken integration. The concept is elegant: a workflow can pause indefinitely, consuming no compute resources, until an external event signals completion.
Here's the architecture I implemented:
Client posts message to SQS (Entry Queue)
↓
SQS triggers Step Function execution
↓
Step Function STEP 1: Send message to Processing Queue (with task token)
↓
Processing Queue triggers Worker Lambda
↓
Worker Lambda submits request to external vendor
↓
Step Function PAUSES (costs nothing while waiting)
↓
Vendor webhook arrives → calls SendTaskSuccess with stored token
↓
Step Function resumes and continues to next steps
The key insight: the Step Function sends a message containing a task token, then waits. When the external process completes, it uses that token to resume the workflow.
No polling. No idle compute. Clean separation of concerns.
How It Works
Entry Point: SQS Triggers Step Function
The flow starts when a client posts a message to the entry queue. This SQS queue has an EventBridge Pipes or Lambda trigger that starts a new Step Function execution:ENTRY QUEUE receives message:
{
requestId: "req-123",
vendorId: "vendor-abc",
payload: { ... }
}
TRIGGER starts Step Function with this input
Step 1: Send to Processing Queue with Task Token
The Step Function's first state sends a message to the processing queue, including the auto-generated task token:STATE: SubmitForProcessing
TYPE: Task with waitForTaskToken
ACTION: Send message to Processing Queue containing:
- Original request data
- Task token (auto-generated: $$.Task.Token)
TIMEOUT: 24 hours
ON SUCCESS: Continue to next state
ON TIMEOUT: Go to HandleTimeout state
The message sent to the processing queue looks like:
{
requestId: "req-123",
vendorId: "vendor-abc",
payload: { ... },
taskToken: "AAAAKgAAAAI..." // Auto-generated by Step Functions
}
Processing Queue Triggers Worker Lambda
The processing queue has a Lambda trigger. When the message arrives, the worker Lambda executes:FUNCTION WorkerLambda(sqsMessage):
requestId = sqsMessage.requestId
taskToken = sqsMessage.taskToken
// Store token for later retrieval when webhook arrives
STORE in DynamoDB:
key: requestId
value: taskToken
ttl: 2 days
// Submit to external vendor
vendorResponse = CALL VendorAPI.Submit(
externalReference: requestId,
callbackUrl: "https://my-api.com/webhooks/vendor/{requestId}",
payload: sqsMessage.payload
)
LOG "Submitted to vendor, their reference: {vendorResponse.referenceId}"
// IMPORTANT: Do NOT call SendTaskSuccess here
// The Step Function stays paused until the webhook arrives
At this point, the Step Function is paused. The worker Lambda has finished, but the workflow is waiting for the external callback.
Webhook Resumes the Workflow
When the vendor calls the webhook endpoint (minutes, hours, or days later):FUNCTION HandleVendorWebhook(request):
requestId = request.pathParameters.requestId
webhookData = PARSE request.body
// Retrieve the stored task token
tokenRecord = GET from DynamoDB where key = requestId
IF tokenRecord not found:
RETURN 404 "No pending integration"
taskToken = tokenRecord.taskToken
// Resume the Step Function
IF webhookData.status == "COMPLETED":
CALL StepFunctions.SendTaskSuccess(
token: taskToken,
output: { vendorData: webhookData }
)
ELSE IF webhookData.status == "FAILED":
CALL StepFunctions.SendTaskFailure(
token: taskToken,
error: "VendorFailed",
cause: webhookData.errorMessage
)
// Clean up
DELETE from DynamoDB where key = requestId
RETURN 200 OK
The SendTaskSuccess call wakes up the Step Function, which continues to its next state with the vendor's response data.
Why This Architecture?
Decoupling Entry from Processing
By using SQS as the entry point, clients get immediate acknowledgment. They don't wait for the Step Function to start or for any processing to begin. The message is queued, and they're done.Automatic Retry and Dead Letter Handling
Both queues can have dead letter queues configured. If the Step Function fails to start, or if the worker Lambda fails, messages aren't lost.Cost Efficiency
- Entry Queue → Step Function: Minimal cost (SQS trigger)
- Step Function waiting: $0 while paused
- Worker Lambda: Only runs once per request
- Webhook handler: Only runs when vendor responds
Advanced Patterns I've Implemented
Multi-Vendor Orchestration
Some workflows require coordination across multiple external systems. After the initial wait, the Step Function can branch:STATE: ProcessOrder (after vendor responds)
TYPE: Parallel
BRANCHES:
Branch 1: Submit to Payment Provider
- Send to Payment Queue with new task token
- Wait for payment confirmation (5 min timeout)
Branch 2: Submit to Fulfillment Partner
- Send to Fulfillment Queue with new task token
- Wait for shipping confirmation (24 hour timeout)
NEXT: FinalizeOrder (when BOTH complete)
Each branch independently sends messages, waits for callbacks, and resumes.
Human Approval Workflows
Some compliance processes require manual review:STATE: AwaitComplianceApproval
TYPE: Task with waitForTaskToken
ACTION: Send to Approval Queue with reviewer email and task token
TIMEOUT: 7 days
NEXT: ProcessApprovalDecision
The approval queue triggers a Lambda that sends an email with approve/reject links. When the reviewer clicks, the API calls SendTaskSuccess or SendTaskFailure.
Progress Tracking for Long Waits
For user-facing integrations, I update a status table that the frontend can query:FUNCTION UpdateIntegrationStatus(requestId, status, details):
UPDATE DynamoDB "IntegrationStatus":
key: requestId
status: status // "Queued", "Submitted", "Awaiting Vendor", "Complete"
details: details
updatedAt: now()
The frontend polls this lightweight table instead of the workflow itself.
Production Metrics
After four months in production:| Metric | Before | After |
|---|---|---|
| Integration Requests Processed | - | 14,000+ |
| Average Wait Duration | - | 4.2 hours |
| Longest Successful Wait | - | 6 days (manual review) |
| Monthly Infrastructure Cost | ~$210 | ~$6.80 |
| Polling-Related Lambda Invocations | 2.8M/month | 0 |
Lessons Learned
Always Configure Timeouts
My initial implementation omitted timeouts. One workflow remained in a waiting state for three weeks—the vendor's webhook endpoint was misconfigured on their side. Now every wait state has explicit timeout with appropriate error handling.Store Task Tokens Securely
Task tokens are essentially bearer credentials. Anyone with the token can complete your workflow. I store them in DynamoDB with TTL and validate webhook signatures before using the token.Implement Dead Letter Queues on Both Queues
Messages that fail processing move to a DLQ for investigation. This is configured at the SQS queue level with amaxReceiveCount of 3.
Handle Duplicate Webhooks
Vendors sometimes send duplicate callbacks. I use conditional database updates—only process the webhook if the request exists AND hasn't been marked as processed. If the condition fails, I know it's a duplicate and skip processing.Idempotency in Worker Lambda
If the worker Lambda fails after submitting to the vendor but before completing, SQS will retry. The worker must handle this—check if already submitted before calling the vendor API again.When to Apply This Pattern
Appropriate Use Cases:- External vendor/partner integrations with webhook callbacks
- Third-party API calls with unpredictable response times
- Human-in-the-loop approval workflows
- Multi-system orchestration
- Any operation where you "fire and wait"
- Sub-second synchronous operations
- Simple fire-and-forget notifications
- Internal service-to-service calls (use direct invocation)
Conclusion
The pattern—SQS entry point → Step Function → SQS with task token → Lambda → wait for callback—solved a genuine architectural challenge for my integration layer. The key benefits:- Decoupled entry: Clients post to SQS and get immediate acknowledgment
- Zero-cost waiting: Step Functions pause without consuming resources
- Automatic retries: SQS handles failures at each stage
- Clean separation: Each component has a single responsibility