technicalJanuary 28, 202615 min read

Orchestrating Long-Running Workflows - Lessons from Production

AWS Step Functions SQS Async Workflows Serverless

The Problem I Faced

Three months into a major platform initiative, I encountered a scaling limitation that required architectural intervention.

I had built an integration pipeline connecting the platform with external vendor systems. The workflow: receive a request, submit it to the vendor's API, and wait for their webhook callback confirming completion. The challenge? Vendor processing times ranged from 30 seconds to several hours, and some required manual review on their end.

My initial implementation used database polling every 5 seconds to check for webhook arrivals. Lambda functions were constantly spinning. CloudWatch costs were escalating, and the architecture created unnecessary coupling between systems.

I spent a weekend researching alternatives. What I discovered transformed my approach to external system integration.

The Callback Pattern

The solution was Step Functions' .waitForTaskToken integration. The concept is elegant: a workflow can pause indefinitely, consuming no compute resources, until an external event signals completion.

Here's the architecture I implemented:

Client posts message to SQS (Entry Queue)
    ↓
SQS triggers Step Function execution
    ↓
Step Function STEP 1: Send message to Processing Queue (with task token)
    ↓
Processing Queue triggers Worker Lambda
    ↓
Worker Lambda submits request to external vendor
    ↓
Step Function PAUSES (costs nothing while waiting)
    ↓
Vendor webhook arrives → calls SendTaskSuccess with stored token
    ↓
Step Function resumes and continues to next steps
The key insight: the Step Function sends a message containing a task token, then waits. When the external process completes, it uses that token to resume the workflow. No polling. No idle compute. Clean separation of concerns.

How It Works

Entry Point: SQS Triggers Step Function

The flow starts when a client posts a message to the entry queue. This SQS queue has an EventBridge Pipes or Lambda trigger that starts a new Step Function execution:
ENTRY QUEUE receives message:
    {
        requestId: "req-123",
        vendorId: "vendor-abc",
        payload: { ... }
    }

TRIGGER starts Step Function with this input

Step 1: Send to Processing Queue with Task Token

The Step Function's first state sends a message to the processing queue, including the auto-generated task token:
STATE: SubmitForProcessing
  TYPE: Task with waitForTaskToken
  ACTION: Send message to Processing Queue containing:
  • Original request data
  • Task token (auto-generated: $$.Task.Token)
TIMEOUT: 24 hours ON SUCCESS: Continue to next state ON TIMEOUT: Go to HandleTimeout state
The message sent to the processing queue looks like:
{
    requestId: "req-123",
    vendorId: "vendor-abc", 
    payload: { ... },
    taskToken: "AAAAKgAAAAI..."  // Auto-generated by Step Functions
}

Processing Queue Triggers Worker Lambda

The processing queue has a Lambda trigger. When the message arrives, the worker Lambda executes:
FUNCTION WorkerLambda(sqsMessage):
    requestId = sqsMessage.requestId
    taskToken = sqsMessage.taskToken
    
    // Store token for later retrieval when webhook arrives
    STORE in DynamoDB:
        key: requestId
        value: taskToken
        ttl: 2 days
    
    // Submit to external vendor
    vendorResponse = CALL VendorAPI.Submit(
        externalReference: requestId,
        callbackUrl: "https://my-api.com/webhooks/vendor/{requestId}",
        payload: sqsMessage.payload
    )
    
    LOG "Submitted to vendor, their reference: {vendorResponse.referenceId}"
    
    // IMPORTANT: Do NOT call SendTaskSuccess here
    // The Step Function stays paused until the webhook arrives
At this point, the Step Function is paused. The worker Lambda has finished, but the workflow is waiting for the external callback.

Webhook Resumes the Workflow

When the vendor calls the webhook endpoint (minutes, hours, or days later):
FUNCTION HandleVendorWebhook(request):
    requestId = request.pathParameters.requestId
    webhookData = PARSE request.body
    
    // Retrieve the stored task token
    tokenRecord = GET from DynamoDB where key = requestId
    
    IF tokenRecord not found:
        RETURN 404 "No pending integration"
    
    taskToken = tokenRecord.taskToken
    
    // Resume the Step Function
    IF webhookData.status == "COMPLETED":
        CALL StepFunctions.SendTaskSuccess(
            token: taskToken,
            output: { vendorData: webhookData }
        )
    ELSE IF webhookData.status == "FAILED":
        CALL StepFunctions.SendTaskFailure(
            token: taskToken,
            error: "VendorFailed",
            cause: webhookData.errorMessage
        )
    
    // Clean up
    DELETE from DynamoDB where key = requestId
    
    RETURN 200 OK
The SendTaskSuccess call wakes up the Step Function, which continues to its next state with the vendor's response data.

Why This Architecture?

Decoupling Entry from Processing

By using SQS as the entry point, clients get immediate acknowledgment. They don't wait for the Step Function to start or for any processing to begin. The message is queued, and they're done.

Automatic Retry and Dead Letter Handling

Both queues can have dead letter queues configured. If the Step Function fails to start, or if the worker Lambda fails, messages aren't lost.

Cost Efficiency

  • Entry Queue → Step Function: Minimal cost (SQS trigger)
  • Step Function waiting: $0 while paused
  • Worker Lambda: Only runs once per request
  • Webhook handler: Only runs when vendor responds
Compare this to polling every 5 seconds for hours—the savings are dramatic.

Advanced Patterns I've Implemented

Multi-Vendor Orchestration

Some workflows require coordination across multiple external systems. After the initial wait, the Step Function can branch:
STATE: ProcessOrder (after vendor responds)
  TYPE: Parallel
  BRANCHES:
    Branch 1: Submit to Payment Provider
  • Send to Payment Queue with new task token
  • Wait for payment confirmation (5 min timeout)
Branch 2: Submit to Fulfillment Partner
  • Send to Fulfillment Queue with new task token
  • Wait for shipping confirmation (24 hour timeout)
NEXT: FinalizeOrder (when BOTH complete)
Each branch independently sends messages, waits for callbacks, and resumes.

Human Approval Workflows

Some compliance processes require manual review:
STATE: AwaitComplianceApproval
  TYPE: Task with waitForTaskToken
  ACTION: Send to Approval Queue with reviewer email and task token
  TIMEOUT: 7 days
  NEXT: ProcessApprovalDecision
The approval queue triggers a Lambda that sends an email with approve/reject links. When the reviewer clicks, the API calls SendTaskSuccess or SendTaskFailure.

Progress Tracking for Long Waits

For user-facing integrations, I update a status table that the frontend can query:
FUNCTION UpdateIntegrationStatus(requestId, status, details):
    UPDATE DynamoDB "IntegrationStatus":
        key: requestId
        status: status  // "Queued", "Submitted", "Awaiting Vendor", "Complete"
        details: details
        updatedAt: now()
The frontend polls this lightweight table instead of the workflow itself.

Production Metrics

After four months in production:
MetricBeforeAfter
Integration Requests Processed-14,000+
Average Wait Duration-4.2 hours
Longest Successful Wait-6 days (manual review)
Monthly Infrastructure Cost~$210~$6.80
Polling-Related Lambda Invocations2.8M/month0
The cost reduction was significant: eliminating polling and idle compute reduced monthly spend by over 96%.

Lessons Learned

Always Configure Timeouts

My initial implementation omitted timeouts. One workflow remained in a waiting state for three weeks—the vendor's webhook endpoint was misconfigured on their side. Now every wait state has explicit timeout with appropriate error handling.

Store Task Tokens Securely

Task tokens are essentially bearer credentials. Anyone with the token can complete your workflow. I store them in DynamoDB with TTL and validate webhook signatures before using the token.

Implement Dead Letter Queues on Both Queues

Messages that fail processing move to a DLQ for investigation. This is configured at the SQS queue level with a maxReceiveCount of 3.

Handle Duplicate Webhooks

Vendors sometimes send duplicate callbacks. I use conditional database updates—only process the webhook if the request exists AND hasn't been marked as processed. If the condition fails, I know it's a duplicate and skip processing.

Idempotency in Worker Lambda

If the worker Lambda fails after submitting to the vendor but before completing, SQS will retry. The worker must handle this—check if already submitted before calling the vendor API again.

When to Apply This Pattern

Appropriate Use Cases:
  • External vendor/partner integrations with webhook callbacks
  • Third-party API calls with unpredictable response times
  • Human-in-the-loop approval workflows
  • Multi-system orchestration
  • Any operation where you "fire and wait"
Less Suitable For:
  • Sub-second synchronous operations
  • Simple fire-and-forget notifications
  • Internal service-to-service calls (use direct invocation)

Conclusion

The pattern—SQS entry point → Step Function → SQS with task token → Lambda → wait for callback—solved a genuine architectural challenge for my integration layer. The key benefits:
  • Decoupled entry: Clients post to SQS and get immediate acknowledgment
  • Zero-cost waiting: Step Functions pause without consuming resources
  • Automatic retries: SQS handles failures at each stage
  • Clean separation: Each component has a single responsibility
For anyone building integrations with external vendors, payment providers, or any system with unpredictable response times, this architecture cleanly separates workflow state from external dependencies. The initial investment in understanding the pattern pays dividends in reduced complexity and operational costs.
Ivan Kikhtan

Ivan Kikhtan

Full-Stack Engineer & Technical Lead with 5+ years of experience building scalable cloud-native solutions. Passionate about serverless architectures, developer productivity, and sharing knowledge.

Connect on LinkedIn