Webhook Retry Logic: Handling Failures Gracefully
Why Webhooks Fail
Webhook deliveries fail for many reasons: your server is temporarily down, a deployment is in progress, your handler has a bug that causes a 500 error, the network has a transient issue, or your server is overloaded and responding too slowly. These failures are inevitable — no system has 100% uptime.
When a webhook delivery fails, the provider needs to decide what to do. Most providers implement automatic retries — they will attempt to deliver the webhook again after a delay, often with exponential backoff. This retry mechanism is critical for reliability, but it also means your handler must be designed to work correctly with retries.
Understanding how different providers handle retries helps you build more robust webhook handlers. The retry policies vary significantly between providers in terms of the number of attempts, the delays between attempts, and what happens after all retries are exhausted.
How Providers Handle Retries
Stripe retries webhooks up to 3 times over several hours with exponential backoff. If all retries fail, the event is marked as failed in the Stripe Dashboard, but you can manually trigger a re-delivery. After consistent failures, Stripe may disable the webhook endpoint entirely.
GitHub retries failed deliveries up to 3 times at 10-second intervals. The retry attempts are visible in the repository's webhook settings under "Recent Deliveries." GitHub does not disable endpoints after failures, but prolonged failures generate warning emails.
Shopify retries webhooks for up to 48 hours. After 19 consecutive failures (regardless of retry timeline), Shopify automatically removes the webhook subscription. This is aggressive — a weekend outage could lose your webhooks. Monitor your Shopify webhook subscriptions proactively.
Slack retries failed events 3 times with exponential backoff over about 30 minutes. If all retries fail, the event is dropped. Slack also has a 3-second response timeout — if your handler takes longer, Slack treats it as a failure.
Building Retry-Resilient Handlers
The most important principle for handling retries is idempotency: processing the same event multiple times should produce the same result as processing it once. If a payment webhook is retried, your handler should not charge the customer twice or create duplicate orders.
Implement idempotency using the event's unique ID:
app.post('/webhook', async (req, res) => {
const eventId = req.body.id;// Check if we already processed this event const existing = await db.processedEvents.findUnique({ where: { eventId } });
if (existing) {
console.log(Event ${eventId} already processed, skipping);
return res.status(200).send('OK');
}
// Process the event try { await processEvent(req.body); await db.processedEvents.create({ data: { eventId, processedAt: new Date() } }); } catch (error) { console.error('Processing failed:', error); return res.status(500).send('Processing failed'); }
res.status(200).send('OK'); }); ```
Return a 200 status quickly, even before processing completes. Queue the actual processing work and handle it asynchronously. This prevents the provider from timing out and triggering unnecessary retries.
Responding Correctly to the Provider
Providers determine retry behavior based on your response status code. Return 2xx (200, 201, 202, 204) to indicate successful receipt. Return 4xx to indicate a permanent failure (bad request, authentication error) — most providers will not retry 4xx responses. Return 5xx to indicate a temporary failure that should be retried.
Be thoughtful about which status codes to return. If signature verification fails, return 401 — retrying will not help because the signature will be the same. If your database is temporarily unavailable, return 503 — a retry after the database recovers will succeed.
Do not return 200 if your handler failed to process the event but you want to prevent retries. This hides failures and can lead to data loss. Instead, return an appropriate error code and implement idempotent retry handling so retries work correctly.
If your handler is slow (database operations, external API calls), respond with 200 immediately and process the event in a background job. This prevents timeouts while ensuring the event is eventually processed.
Testing Retry Scenarios with ReqPour
ReqPour's replay feature is ideal for testing retry handling. Capture a webhook event in the ReqPour dashboard, then replay it multiple times to verify your handler's idempotency logic. The replayed requests are identical to the original, simulating what a provider retry looks like.
Test these scenarios: process an event successfully, then replay it — your handler should detect the duplicate and skip it. Introduce a temporary database failure, let the handler return 500, then fix the issue and replay — the event should process correctly on the retry.
Also test your handler's response times. Use the ReqPour dashboard to monitor how long your handler takes to respond. If it approaches the provider's timeout threshold (typically 3-10 seconds), you need to move processing to a background job.
During development, you can use the ReqPour request inspector to examine the exact payload of retried events. Some providers add retry metadata to the headers (like a retry count), which you can inspect in the dashboard to understand the retry behavior.
Related
Get started with ReqPour
Catch, inspect, and relay webhooks to localhost. Free to start, $3/mo for Pro.