How I Built Payment APIs: Architecture, Authentication & Integration
Notes on production-grade payment REST APIs — token authentication, two-phase commit, bill locking, idempotency keys, and the dispute-resolution patterns that actually hold up in audit.
Payment APIs sit at the worst end of the engineering trade-off curve: strict correctness, near-zero downtime, full audit, and yet they’re the APIs that financial-partner engineers will read line-by-line, on a Friday, before they ship a customer-facing flow that touches money. This post is a field-tested walk through the choices that hold up in that review.
The shape of a good payment endpoint
Every payment endpoint in our stack returns the same envelope:
{
"status": "OK",
"code": "P0000",
"message": "Payment accepted",
"data": {
"transactionId": "TXN-2026-04-A7F3-9912",
"amount": "1250.00",
"currency": "BDT",
"billingPeriod": "2026-04",
"lockedAt": "2026-04-30T08:21:14Z"
},
"trace": "Z9-4d2a"
}
Three things matter here:
- A short numeric code (
P0000,P1001, …) that maps cleanly to a partner-side state machine. Free-textmessagefields are for humans; partners must never branch on them. - A
traceid that ties a response back to a single row in the request log. That row is the source of truth in disputes. - No camelCase / snake_case mixing. Pick one. We chose camelCase because most partner-side stacks today are TypeScript-ish, and partners care more about typing simply than about server-side aesthetic.
Authentication
The cleanest pattern we’ve shipped is a two-step token flow:
- Partner POSTs API key + signed timestamp to
/auth/token. - Server returns a short-lived (15 min) JWT.
- Every subsequent call carries
Authorization: Bearer <jwt>.
The signed-timestamp step matters: it prevents replay if an API key leaks into a partner’s logs, because the signature is only valid for a 60-second window.
The JWT body looks like this:
{
"iss": "mtb.payments",
"sub": "partner_12",
"aud": "bill-collection",
"scope": ["read:bill", "post:payment"],
"exp": 1714467600,
"ipHash": "a7…"
}
scope lets us issue read-only tokens to partner ops teams without exposing
the payment posting endpoint. ipHash rejects a token used from outside the
partner’s whitelisted IP range — the only secret that ever leaves our edge
is the JWT, never the API key.
Two-phase payment commit
The hardest class of bug in payments is the “partner thinks it succeeded, biller doesn’t see it” bug. We solve it with an explicit two-phase commit:
- Lock — partner calls
POST /bill/lockto reserve the bill. Server returns alockTokenvalid for 5 minutes. The biller’s downstream system now refuses any other payment on that bill until the lock expires. - Pay — partner calls
POST /bill/paywith thelockTokenand their transaction id. Server commits and returns a finaltransactionId.
The lock step is what makes the system safe under partner-side retries.
Without it, a partner that retries because of a timeout on the pay step can
double-debit the customer. With it, the second attempt fails closed with a
specific P1102 — Lock Expired and the partner knows to call /bill/lock
again.
Idempotency keys
Every state-changing endpoint accepts an optional Idempotency-Key header.
We store the (partner, key, response) tuple for 24 hours. A retry with the
same key returns the original response byte-for-byte — including trace —
so the partner sees a stable view.
The TTL is 24 hours and not “forever” because partners regenerate keys per day; storing them indefinitely is a slow leak of memory and a fast leak of PII.
Observability
Three telemetry surfaces, every payment touches all three:
- A canonical log line per request: trace, partner, endpoint, request hash, response code, total latency, and whether each downstream call hit cache or origin. This is the only thing dispute-resolution looks at.
- A metric: counter on
(partner, endpoint, response_code). We alert on a 5-minute window where any partner’s error rate climbs above 2%. - A distributed trace, sampled at 1 in 100. The full call graph is too expensive to keep, but the 1% sample is where we find the slow-DB-query problems.
Latency budget
What “fast enough” means depends on who’s reading. For a partner whose checkout flow is interactive, p95 under 800 ms is the line. For a batch biller pushing 50k rows overnight, p99 under 4 s is fine — they care about throughput, not latency.
What I’d do differently
- Ship a partner SDK on day 1. We let partners hand-roll clients for six months. Five of them got the HMAC signing wrong in five different ways. An SDK would have removed an entire class of support tickets.
- Sign error responses too. We sign success responses; we don’t sign errors. A malicious middle-box could tell a partner “everything failed” to hide a successful payment from them. Unlikely in practice, but cheap to fix.
- Make the
traceid parsable. Ours is an opaque string. If it embedded the partner id and timestamp prefix, on-call could read it without looking it up.
Payment APIs are unglamorous — most of the design is about making the boring case rock-solid so the interesting case (a real outage, a real dispute) is the only one anyone has to think about.