Web Services & APIs — Operations¶
Practical guide to deploying, documenting, securing, versioning, testing, and monitoring web service APIs.
API Specification Formats¶
Specifications are machine-readable contracts for APIs — enabling codegen, mock servers, linting, and documentation.
OpenAPI 3.1 (REST)¶
The industry standard for describing RESTful HTTP APIs. Version 3.1 aligns with JSON Schema draft 2020-12.
openapi: 3.1.0
info:
title: Orders API
version: 2.4.0
contact:
email: [email protected]
license:
name: Apache 2.0
servers:
- url: https://api.example.com/v2
description: Production
- url: https://sandbox.api.example.com/v2
description: Sandbox
paths:
/orders/{orderId}:
get:
operationId: getOrder
summary: Retrieve a single order
tags: [Orders]
parameters:
- name: orderId
in: path
required: true
schema:
type: string
format: uuid
responses:
"200":
description: Order found
content:
application/json:
schema:
$ref: "#/components/schemas/Order"
"404":
$ref: "#/components/responses/NotFound"
security:
- bearerAuth: []
components:
schemas:
Order:
type: object
required: [id, status, createdAt]
properties:
id:
type: string
format: uuid
status:
type: string
enum: [pending, confirmed, shipped, delivered, cancelled]
createdAt:
type: string
format: date-time
responses:
NotFound:
description: Resource not found
content:
application/json:
schema:
$ref: "#/components/schemas/ProblemDetail"
securitySchemes:
bearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
Key OpenAPI 3.1 improvements over 3.0:
- Full JSON Schema 2020-12 alignment (replaces OpenAPI's extended subset)
- webhooks top-level field for inbound webhooks
- discriminator improvements, const, $schema per-schema
- exclusiveMinimum/exclusiveMaximum now numeric (not boolean)
AsyncAPI 3.0 (Event-Driven APIs)¶
OpenAPI equivalent for WebSocket, MQTT, Kafka, AMQP, SNS/SQS APIs.
asyncapi: 3.0.0
info:
title: Order Events API
version: 1.0.0
channels:
orderCreated:
address: orders.created
messages:
OrderCreated:
payload:
type: object
properties:
orderId:
type: string
customerId:
type: string
operations:
onOrderCreated:
action: receive
channel:
$ref: "#/channels/orderCreated"
Protocol Buffers IDL (gRPC)¶
See architecture#protocol-buffers for the full .proto format. The .proto file IS the API spec for gRPC services.
Tooling comparison:
| Format | Ecosystem | Codegen | Mock Server | Linting |
|---|---|---|---|---|
| OpenAPI 3.1 | REST | Any language | Prism, WireMock | Spectral, Vacuum |
| AsyncAPI 3.0 | Event-driven | Node.js, Java | Microcks | AsyncAPI Studio |
| Protobuf | gRPC | Any language | grpc-go test server | buf lint |
| WSDL | SOAP | Java, .NET, Python | SoapUI | SOAP UI |
API Gateways¶
An API gateway is the single entry point for all client traffic — handling routing, auth enforcement, rate limiting, observability, and protocol translation.
flowchart LR
C1[Mobile Client] --> GW[API Gateway]
C2[Browser] --> GW
C3[Partner API] --> GW
GW -->|/orders| OS[Orders Service]
GW -->|/users| US[User Service]
GW -->|/products| PS[Product Service]
GW --> Auth[Auth Service]
GW --> RL[Rate Limiter\nRedis]
GW --> Log[Observability\nDatadog / Grafana]
Kong Gateway¶
Open-source gateway built on NGINX + OpenResty (Lua). Enterprise tier adds RBAC, Dev Portal, and Vitals analytics.
# Kong declarative config (deck format)
services:
- name: orders-service
url: http://orders-service:8080
plugins:
- name: rate-limiting
config:
minute: 1000
policy: redis
redis_host: redis
- name: jwt
config:
claims_to_verify: [exp]
routes:
- name: orders-route
paths: [/v2/orders]
strip_path: false
methods: [GET, POST, PUT, PATCH, DELETE]
# Kong Admin API — add plugin to route
curl -X POST http://kong:8001/routes/orders-route/plugins \
--data name=request-transformer \
--data "config.add.headers[]=X-Request-ID:$(uuidgen)"
Envoy Proxy¶
High-performance C++ proxy developed at Lyft. Operates as data plane in Istio service mesh. Configured via xDS APIs (dynamic) or static YAML.
# Envoy static config — HTTP rate limit filter
http_filters:
- name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: orders_api
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_service
transport_api_version: V3
AWS API Gateway¶
Managed gateway for REST, HTTP, and WebSocket APIs. Integrates natively with Lambda, ALB, and VPC Link.
# Create HTTP API (simpler, lower cost than REST API)
aws apigatewayv2 create-api \
--name orders-api \
--protocol-type HTTP \
--target arn:aws:lambda:us-east-1:123456789:function:orders-handler
# Add JWT authorizer
aws apigatewayv2 create-authorizer \
--api-id abc123 \
--authorizer-type JWT \
--identity-source '$request.header.Authorization' \
--jwt-configuration Audience=orders-api,Issuer=https://auth.example.com \
--name JwtAuthorizer
Gateway comparison:
| Gateway | Deployment | Config Model | Best For |
|---|---|---|---|
| Kong | Self-hosted / Cloud | Declarative YAML / Admin API | Large teams, plugin ecosystem |
| Envoy | Self-hosted (sidecar) | xDS (dynamic) / YAML | Service mesh, Kubernetes |
| AWS API Gateway | Managed | Console / CDK / SAM | AWS-native serverless |
| Nginx | Self-hosted | Imperative config | Simple reverse proxy |
| Traefik | Self-hosted | Auto-discover (Kubernetes) | Kubernetes ingress |
| Azure API Management | Managed | Portal / ARM / Bicep | Azure-native |
Authentication and Authorization¶
API Keys¶
Simplest scheme. Suitable for server-to-server or developer access where OAuth overhead is unneeded.
Best practices:
- Prefix keys by environment: sk_live_, sk_test_
- Store only the hash (SHA-256) in database — never plaintext
- Rotate on compromise; provide 30-day grace period during planned rotations
- Associate keys with scopes: orders:read, orders:write
JWT (JSON Web Tokens)¶
Stateless bearer tokens. Three base64url-encoded parts: header, payload, signature.
// Payload claims
{
"sub": "user_01HXYZ",
"iss": "https://auth.example.com",
"aud": "orders-api",
"exp": 1745600000,
"iat": 1745596400,
"scope": "orders:read orders:write",
"jti": "01HXYZ-unique-token-id"
}
JWT security checklist:
- Use RS256 (asymmetric) for public key distribution, not HS256 (shared secret)
- Short expiry: 15 minutes for access tokens; refresh tokens via httpOnly cookies
- Validate iss, aud, exp, nbf on every request
- Include jti (JWT ID) for revocation lookup in Redis blocklist
- Never store sensitive data in payload — JWTs are encoded, not encrypted (use JWE for confidentiality)
OAuth 2.0 / OAuth 2.1¶
Authorization Code + PKCE (browser and mobile clients):
sequenceDiagram
participant U as User
participant C as Client App
participant AS as Auth Server
participant RS as Resource Server
C->>C: Generate code_verifier, code_challenge = SHA256(verifier)
C->>AS: GET /authorize?response_type=code&client_id=...&code_challenge=...
AS->>U: Login + Consent screen
U->>AS: Approve
AS->>C: Redirect with ?code=AUTH_CODE
C->>AS: POST /token {code, code_verifier, client_id}
AS->>C: {access_token, refresh_token, expires_in}
C->>RS: GET /orders Authorization: Bearer ACCESS_TOKEN
RS->>C: 200 {orders: [...]}
Client Credentials (machine-to-machine):
curl -X POST https://auth.example.com/oauth/token \
-d grant_type=client_credentials \
-d client_id=service-account \
-d client_secret=secret \
-d scope="orders:read inventory:write"
OAuth 2.1 key changes (draft consolidation): - PKCE mandatory for all public clients - Implicit flow removed - Resource Owner Password Credentials (ROPC) flow removed - Refresh token rotation required for public clients
mTLS (Mutual TLS)¶
Both client and server present certificates — eliminates shared secrets for service-to-service auth.
# Generate client cert signed by your CA
openssl req -new -key client.key -out client.csr \
-subj "/CN=orders-service/O=internal"
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key \
-CAcreateserial -out client.crt -days 365
# Call API with client cert
curl --cert client.crt --key client.key \
--cacert ca.crt \
https://internal-api.example.com/v2/orders
In Kubernetes: use SPIFFE/SPIRE for automatic workload identity, or let Istio inject mTLS transparently via sidecar.
API Versioning¶
Versioning Strategies¶
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| URI path | /v2/orders |
Most visible, easy routing | Breaks resource identity |
| Query param | /orders?version=2 |
Non-breaking URL | Easily forgotten, cache unfriendly |
| Header | API-Version: 2024-01-01 |
Clean URLs | Less discoverable |
| Content negotiation | Accept: application/vnd.api+json;version=2 |
RFC-compliant | Complex client setup |
URI versioning is the most common choice for public APIs (used by Stripe, Twilio, GitHub). Header versioning (calendar-based like Stripe-Version: 2023-10-16) is used by Stripe alongside URI versioning for fine-grained migrations.
Calendar-Based Versioning (Stripe Pattern)¶
Instead of major version bumps, every breaking change gets a calendar date:
Each API key locks to a version at creation. Customers opt into new versions explicitly.
Deprecation Headers (RFC 8594)¶
HTTP/1.1 200 OK
Deprecation: "2026-01-01T00:00:00Z"
Sunset: "2027-01-01T00:00:00Z"
Link: <https://docs.example.com/migration/v3>; rel="successor-version"
Deprecation: when the endpoint was deprecatedSunset: when it will stop working (RFC 8594)Link: migration guide
Non-Breaking vs Breaking Changes¶
Non-breaking (safe to ship): - Adding optional request fields - Adding new response fields - Adding new endpoints - New enum values (unless clients use exhaustive matching)
Breaking (require new version): - Removing or renaming fields - Changing field types - Changing HTTP method for an operation - Altering authentication requirements - Removing enum values
Rate Limiting¶
Rate limiting protects services from abuse, ensures fair usage, and enables monetization tiers.
Algorithms¶
Token Bucket (allow bursting):
capacity = 100 tokens
refill_rate = 10 tokens/second
on request:
if tokens >= cost:
tokens -= cost
return ALLOW
else:
return 429 Too Many Requests
AWS API Gateway and Kong use token bucket by default.
Sliding Window Log (most precise):
Stores timestamp of each request. Counts requests within [now - window, now]. High memory cost at scale.
Sliding Window Counter (approximation, low memory):
Redis-based implementation: two counters (current window, previous window) per key.
Fixed Window (simplest, boundary spike risk):
Resets counter at fixed intervals. A burst at 11:59:59 and 12:00:01 yields 2× the allowed rate.
Response Headers¶
HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1745600000
Retry-After: 30
On 429:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1745600000
Content-Type: application/problem+json
{
"type": "https://api.example.com/errors/rate-limit-exceeded",
"title": "Too Many Requests",
"status": 429,
"detail": "You have exceeded 1000 requests per minute."
}
Rate Limit Keys¶
Choose the right granularity:
| Key | Use Case |
|---|---|
| IP address | Unauthenticated public APIs, DDoS protection |
| API key | Developer tier enforcement |
| User ID | Per-account limits after auth |
| Endpoint | Expensive operations (e.g., /search) |
| Tenant ID | SaaS multi-tenant isolation |
CORS (Cross-Origin Resource Sharing)¶
CORS restricts which browser origins can call your API. It does NOT protect server-to-server calls.
# Preflight request (browser auto-sends for non-simple requests)
OPTIONS /v2/orders HTTP/1.1
Origin: https://app.example.com
Access-Control-Request-Method: POST
Access-Control-Request-Headers: Authorization, Content-Type
# Server response
HTTP/1.1 204 No Content
Access-Control-Allow-Origin: https://app.example.com
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS
Access-Control-Allow-Headers: Authorization, Content-Type, X-Request-ID
Access-Control-Max-Age: 86400
Access-Control-Allow-Credentials: true
Critical rules:
- Never set Access-Control-Allow-Origin: * with Access-Control-Allow-Credentials: true — browsers block it
- Maintain an allowlist of trusted origins; validate dynamically against it
- Cache preflight with Access-Control-Max-Age to reduce OPTIONS overhead
API Design Best Practices¶
Resource Naming¶
# Good — noun-based, plural, lowercase
GET /v2/orders
POST /v2/orders
GET /v2/orders/{orderId}
PUT /v2/orders/{orderId}
PATCH /v2/orders/{orderId}
DELETE /v2/orders/{orderId}
# Nested resources — use sparingly; max 2 levels deep
GET /v2/orders/{orderId}/items
POST /v2/orders/{orderId}/items
# Actions (verbs) — use only for operations that don't map to CRUD
POST /v2/orders/{orderId}/cancel
POST /v2/orders/{orderId}/refund
POST /v2/payments/{paymentId}/capture
Idempotency Keys¶
Prevent duplicate processing when clients retry on network failure.
POST /v2/orders HTTP/1.1
Idempotency-Key: 01HXYZ-unique-request-id
Content-Type: application/json
{"productId": "prod_123", "quantity": 2}
Server logic:
1. Hash Idempotency-Key → look up in idempotency store (Redis/DB)
2. If found and result cached → return cached response immediately
3. If found and in-flight → return 409 Conflict or wait
4. If not found → process, store result keyed to hash, return result
TTL: 24–48 hours (per Stripe: 24h)
Pagination¶
Cursor-based (recommended for large/real-time datasets):
// Request: GET /v2/orders?limit=20&after=01HXYZ
{
"data": [...],
"pagination": {
"limit": 20,
"hasNextPage": true,
"nextCursor": "01HABC",
"hasPrevPage": true,
"prevCursor": "01HWXY"
}
}
Offset-based (simpler, avoid for real-time data — page drift on inserts):
// Request: GET /v2/orders?limit=20&offset=40
{
"data": [...],
"pagination": {
"total": 1847,
"limit": 20,
"offset": 40,
"pages": 93
}
}
Standardized Error Responses (RFC 9457 / Problem Details)¶
{
"type": "https://api.example.com/errors/validation-error",
"title": "Validation Error",
"status": 422,
"detail": "Request body contains invalid fields.",
"instance": "/v2/orders/01HXYZ",
"errors": [
{
"field": "quantity",
"message": "Must be a positive integer",
"code": "INVALID_VALUE"
},
{
"field": "productId",
"message": "Product not found",
"code": "RESOURCE_NOT_FOUND"
}
],
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736"
}
Always include traceId or requestId for support/debugging correlation.
Long-Running Operations (202 Async Pattern)¶
# 1. Client submits job
POST /v2/reports HTTP/1.1
{"type": "monthly-revenue", "month": "2026-03"}
# 2. Server accepts immediately
HTTP/1.1 202 Accepted
Location: /v2/reports/jobs/job_01HXYZ
Retry-After: 30
# 3. Client polls
GET /v2/reports/jobs/job_01HXYZ
# 4a. Still processing
HTTP/1.1 200 OK
{"status": "processing", "progress": 42, "estimatedCompletion": "2026-04-25T10:15:00Z"}
# 4b. Complete
HTTP/1.1 200 OK
{"status": "complete", "resultUrl": "/v2/reports/rpt_01HABC", "expiresAt": "2026-04-26T10:00:00Z"}
# 5. Retrieve result
GET /v2/reports/rpt_01HABC
Alternative: use webhook callback instead of polling — POST /v2/reports body includes callbackUrl.
Filtering, Sorting, Searching¶
# Filtering — use query params
GET /v2/orders?status=pending&customerId=cust_123&createdAfter=2026-01-01
# Sorting — field and direction
GET /v2/orders?sort=-createdAt,+status # minus = desc, plus = asc
# Sparse fieldsets — reduce payload size
GET /v2/orders?fields=id,status,total
# Full-text search
GET /v2/products?q=wireless+headphones&category=electronics
API First Design¶
Design the API contract before writing implementation code.
Workflow:
1. Write OpenAPI spec in YAML (use Spectral to lint against rules)
2. Generate mock server with Prism: prism mock openapi.yaml
3. Share mock URL with frontend team — both sides develop in parallel
4. Generate server stubs with oapi-codegen (Go), openapi-generator (Java/Python/etc.)
5. Write implementation against generated interfaces
6. Run contract tests against live server to verify spec compliance
# Prism mock server (read OpenAPI spec, serve mock responses)
npx @stoplight/prism-cli mock openapi.yaml --port 4010
# Call mock
curl http://localhost:4010/v2/orders/01HXYZ \
-H "Authorization: Bearer test-token"
# Prism validation proxy (forward to real server, validate request/response against spec)
npx @stoplight/prism-cli proxy openapi.yaml http://localhost:8080
Testing¶
REST API Testing (curl)¶
# GET with auth header and pretty JSON
curl -s -X GET https://api.example.com/v2/orders/01HXYZ \
-H "Authorization: Bearer $TOKEN" \
-H "Accept: application/json" | jq
# POST with JSON body
curl -s -X POST https://api.example.com/v2/orders \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-H "Idempotency-Key: $(uuidgen)" \
-d '{"productId": "prod_123", "quantity": 2}' | jq
# Test rate limiting — fire 10 requests rapidly
for i in {1..10}; do
curl -s -o /dev/null -w "%{http_code}\n" \
-H "Authorization: Bearer $TOKEN" \
https://api.example.com/v2/orders
done
# Inspect headers only
curl -sI https://api.example.com/v2/orders
# Follow redirects, show timing
curl -v -w "@curl-format.txt" -L https://api.example.com/v2/orders
gRPC Testing (grpcurl)¶
# Install
brew install grpcurl
# List services (server reflection must be enabled)
grpcurl -plaintext localhost:50051 list
# Describe a service
grpcurl -plaintext localhost:50051 describe orders.OrderService
# Unary call
grpcurl -plaintext \
-H "Authorization: Bearer $TOKEN" \
-d '{"order_id": "01HXYZ"}' \
localhost:50051 orders.OrderService/GetOrder
# Server streaming call
grpcurl -plaintext \
-d '{"customer_id": "cust_123"}' \
localhost:50051 orders.OrderService/WatchOrders
# Call with TLS
grpcurl \
-cert client.crt -key client.key -cacert ca.crt \
api.example.com:443 orders.OrderService/GetOrder \
-d '{"order_id": "01HXYZ"}'
WebSocket Testing (wscat)¶
# Install
npm install -g wscat
# Connect to WebSocket server
wscat -c wss://api.example.com/ws \
--header "Authorization: Bearer $TOKEN"
# Send a message (after connecting)
> {"type": "subscribe", "channel": "orders", "customerId": "cust_123"}
< {"type": "subscribed", "channel": "orders"}
< {"type": "order.updated", "orderId": "01HXYZ", "status": "shipped"}
# Connect with subprotocol
wscat -c wss://api.example.com/ws --subprotocol "v2.orders"
GraphQL Testing (curl + jq)¶
# Introspection query
curl -s -X POST https://api.example.com/graphql \
-H "Content-Type: application/json" \
-d '{"query": "{ __schema { types { name } } }"}' | jq '.data.__schema.types[].name'
# Query with variables
curl -s -X POST https://api.example.com/graphql \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "query GetOrder($id: ID!) { order(id: $id) { status total } }",
"variables": {"id": "01HXYZ"}
}' | jq
# Mutation
curl -s -X POST https://api.example.com/graphql \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "mutation CancelOrder($id: ID!) { cancelOrder(id: $id) { success } }",
"variables": {"id": "01HXYZ"}
}' | jq
Load Testing (k6)¶
// k6 load test script — orders API
import http from "k6/http";
import { check, sleep } from "k6";
import { Rate } from "k6/metrics";
const errorRate = new Rate("errors");
export const options = {
stages: [
{ duration: "30s", target: 50 }, // ramp up to 50 VUs
{ duration: "2m", target: 50 }, // hold
{ duration: "30s", target: 200 }, // spike to 200 VUs
{ duration: "1m", target: 200 }, // hold spike
{ duration: "30s", target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ["p(95)<500"], // 95th percentile < 500ms
errors: ["rate<0.01"], // error rate < 1%
},
};
export default function () {
const res = http.get("https://api.example.com/v2/orders", {
headers: { Authorization: `Bearer ${__ENV.API_TOKEN}` },
});
const ok = check(res, {
"status is 200": (r) => r.status === 200,
"response time < 500ms": (r) => r.timings.duration < 500,
});
errorRate.add(!ok);
sleep(1);
}
Contract Testing (Pact)¶
Consumer-driven contract tests verify that API providers honour contracts expected by consumers.
# Consumer writes expectations → generates pact file
# Provider verifies pact file against running service
# Publish to Pact Broker
npx pact-broker publish ./pacts \
--broker-base-url https://your-pact-broker.example.com \
--consumer-app-version $(git rev-parse HEAD)
# Provider verifies
npx pact-provider-verifier \
--provider-base-url http://localhost:8080 \
--pact-broker-base-url https://your-pact-broker.example.com \
--provider orders-service
Monitoring and Observability¶
Key Metrics (RED Method)¶
| Metric | Description | Alert Threshold (example) |
|---|---|---|
| Rate | Requests per second | Traffic drop > 50% vs baseline |
| Errors | 5xx error rate | > 1% over 5 minutes |
| Duration | p50, p95, p99 latency | p99 > 1000ms |
Additional API-specific metrics:
- 4xx rate (client errors) — spike may indicate breaking change or client bug
- Auth failure rate — spike indicates credential attack or misconfiguration
- Rate limit hit rate (429 responses) — indicate capacity planning needs
- Payload size distribution — detect runaway requests
Distributed Tracing (OpenTelemetry)¶
# Node.js — auto-instrumentation with OTLP export
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
# Inject trace context headers
GET /v2/orders HTTP/1.1
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
tracestate: rend=congo
Propagate traceparent across all service boundaries. Every response should include X-Request-ID or X-Trace-ID tied to the trace.
Structured Logging¶
{
"level": "info",
"timestamp": "2026-04-25T10:00:00.123Z",
"service": "orders-api",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"method": "GET",
"path": "/v2/orders/01HXYZ",
"statusCode": 200,
"durationMs": 47,
"customerId": "cust_123",
"region": "us-east-1"
}
Health Endpoints¶
# Liveness — is the process alive?
GET /health/live
HTTP/1.1 200 OK
{"status": "ok"}
# Readiness — is the service ready to receive traffic?
GET /health/ready
HTTP/1.1 200 OK
{
"status": "ok",
"checks": {
"database": "ok",
"cache": "ok",
"dependencyServiceA": "ok"
}
}
# Degraded state
HTTP/1.1 503 Service Unavailable
{
"status": "degraded",
"checks": {
"database": "ok",
"cache": "error",
"dependencyServiceA": "ok"
}
}
Circuit Breaker Pattern¶
Prevents cascading failures when a downstream dependency is degraded.
States:
CLOSED → normal operation, requests pass through
OPEN → dependency is failing; requests fail fast with 503
HALF_OPEN → test probe requests sent; if success → CLOSED, if fail → OPEN
Transition triggers:
CLOSED → OPEN: failure rate > 50% over last 10 requests (or time window)
OPEN → HALF_OPEN: after cooldown period (e.g. 30 seconds)
HALF_OPEN → CLOSED: 3 consecutive successes
HALF_OPEN → OPEN: 1 failure
Libraries: Resilience4j (Java), polly (.NET), opossum (Node.js), gobreaker (Go).
Webhooks as a Product¶
For APIs that offer webhooks, treat the delivery system as a first-class product.
Delivery Architecture¶
sequenceDiagram
participant ES as Event Source
participant Q as Message Queue
participant WD as Webhook Dispatcher
participant C as Customer Server
ES->>Q: Publish event
Q->>WD: Consume event
WD->>C: POST /webhook (signed payload)
alt Success (2xx)
C->>WD: 200 OK (within 5s)
WD->>Q: Ack message
else Failure / Timeout
WD->>Q: Nack / retry
WD->>WD: Exponential backoff\n(5s, 25s, 125s, ...)
WD->>WD: After 72h: mark dead, alert
end
Payload Signing (HMAC-SHA256)¶
import hashlib, hmac, time
def sign_payload(secret: str, payload: bytes) -> str:
timestamp = str(int(time.time()))
message = f"{timestamp}.{payload.decode()}".encode()
signature = hmac.new(secret.encode(), message, hashlib.sha256).hexdigest()
return f"t={timestamp},v1={signature}"
def verify_signature(secret: str, payload: bytes, header: str, tolerance: int = 300) -> bool:
parts = dict(part.split("=", 1) for part in header.split(","))
timestamp = int(parts["t"])
if abs(time.time() - timestamp) > tolerance:
return False # replay attack
message = f"{timestamp}.{payload.decode()}".encode()
expected = hmac.new(secret.encode(), message, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, parts["v1"])
Reliability Patterns¶
| Pattern | Implementation |
|---|---|
| Idempotency keys | Include webhookId in payload; consumer deduplicates |
| Immediate 200 | Return 200 before processing; use queue for async work |
| Retry with backoff | 5s → 25s → 125s → 625s; max 72h delivery window |
| Dead letter queue | After max retries, route to DLQ; alert operator |
| Event ordering | Include sequence counter; consumer handles out-of-order |
| CloudEvents format | Standardize payload envelope (specversion, type, source, id) |
Webhook Management Portal (product features)¶
- Endpoint registration with per-event-type subscription
- Delivery attempt log with request/response bodies (last 30 days)
- Manual replay of failed deliveries
- HMAC secret rotation (grace period supporting both old + new key)
200 OKwebhook test endpoint for validation
API Tooling Ecosystem¶
| Category | Tools |
|---|---|
| API spec editors | Stoplight Studio, Swagger Editor, Redocly |
| Linting | Spectral (OpenAPI/AsyncAPI), buf lint (Protobuf) |
| Mock servers | Prism, WireMock, Microcks |
| Client testing | Postman, Insomnia, Bruno, HTTPie |
| CLI testing | curl, httpie, grpcurl, wscat, mqtt-cli |
| Load testing | k6, Gatling, Locust, Apache JMeter |
| Contract testing | Pact, Dredd, Schemathesis |
| Documentation | Redoc, Swagger UI, Scalar, Mintlify |
| API gateways | Kong, Envoy, AWS API Gateway, Traefik |
| Service mesh | Istio, Linkerd, Consul Connect |
| Code generation | openapi-generator, oapi-codegen, buf generate |
| Monitoring | Datadog APM, Grafana + Prometheus, New Relic |
Sources¶
OpenAPI & Specification¶
- OpenAPI 3.1.0 Specification
- AsyncAPI 3.0 Documentation
- Spectral OpenAPI Linting — Stoplight
- Microsoft REST API Guidelines
- Google API Design Guide
- Zalando RESTful API Guidelines
Authentication & Security¶
- OAuth 2.0 — RFC 6749
- OAuth 2.1 Draft
- PKCE — RFC 7636
- JSON Web Tokens — RFC 7519
- OWASP API Security Top 10 (2023)
- SPIFFE/SPIRE — Workload Identity
API Design¶
- Idempotency Keys — Stripe Docs
- HTTP Problem Details — RFC 9457
- Sunset Header — RFC 8594
- Cursor Pagination — Slack Engineering
- API Versioning — Stripe Blog
Rate Limiting & Gateways¶
- Kong Gateway Documentation
- Envoy Proxy — Rate Limiting
- AWS API Gateway Documentation
- Rate Limiting Algorithms — Stripe Engineering