How I Built Payment APIs: Architecture, Authentication & Integration

Payment APIs sit at the worst end of the engineering trade-off curve: strict correctness, near-zero downtime, full audit, and yet they’re the APIs that financial-partner engineers will read line-by-line, on a Friday, before they ship a customer-facing flow that touches money. This post is a field-tested walk through the choices that hold up in that review.

The shape of a good payment endpoint

Every payment endpoint in our stack returns the same envelope:

{
  "status": "OK",
  "code": "P0000",
  "message": "Payment accepted",
  "data": {
    "transactionId": "TXN-2026-04-A7F3-9912",
    "amount": "1250.00",
    "currency": "BDT",
    "billingPeriod": "2026-04",
    "lockedAt": "2026-04-30T08:21:14Z"
  },
  "trace": "Z9-4d2a"
}

Three things matter here:

A short numeric code (P0000, P1001, …) that maps cleanly to a partner-side state machine. Free-text message fields are for humans; partners must never branch on them.
A trace id that ties a response back to a single row in the request log. That row is the source of truth in disputes.
No camelCase / snake_case mixing. Pick one. We chose camelCase because most partner-side stacks today are TypeScript-ish, and partners care more about typing simply than about server-side aesthetic.

Authentication

The cleanest pattern we’ve shipped is a two-step token flow:

Partner POSTs API key + signed timestamp to /auth/token.
Server returns a short-lived (15 min) JWT.
Every subsequent call carries Authorization: Bearer <jwt>.

The signed-timestamp step matters: it prevents replay if an API key leaks into a partner’s logs, because the signature is only valid for a 60-second window.

The JWT body looks like this:

{
  "iss": "mtb.payments",
  "sub": "partner_12",
  "aud": "bill-collection",
  "scope": ["read:bill", "post:payment"],
  "exp": 1714467600,
  "ipHash": "a7…"
}

scope lets us issue read-only tokens to partner ops teams without exposing the payment posting endpoint. ipHash rejects a token used from outside the partner’s whitelisted IP range — the only secret that ever leaves our edge is the JWT, never the API key.

Two-phase payment commit

The hardest class of bug in payments is the “partner thinks it succeeded, biller doesn’t see it” bug. We solve it with an explicit two-phase commit:

Lock — partner calls POST /bill/lock to reserve the bill. Server returns a lockToken valid for 5 minutes. The biller’s downstream system now refuses any other payment on that bill until the lock expires.
Pay — partner calls POST /bill/pay with the lockToken and their transaction id. Server commits and returns a final transactionId.

The lock step is what makes the system safe under partner-side retries. Without it, a partner that retries because of a timeout on the pay step can double-debit the customer. With it, the second attempt fails closed with a specific P1102 — Lock Expired and the partner knows to call /bill/lock again.

Idempotency keys

Every state-changing endpoint accepts an optional Idempotency-Key header. We store the (partner, key, response) tuple for 24 hours. A retry with the same key returns the original response byte-for-byte — including trace — so the partner sees a stable view.

The TTL is 24 hours and not “forever” because partners regenerate keys per day; storing them indefinitely is a slow leak of memory and a fast leak of PII.

Observability

Three telemetry surfaces, every payment touches all three:

A canonical log line per request: trace, partner, endpoint, request hash, response code, total latency, and whether each downstream call hit cache or origin. This is the only thing dispute-resolution looks at.
A metric: counter on (partner, endpoint, response_code). We alert on a 5-minute window where any partner’s error rate climbs above 2%.
A distributed trace, sampled at 1 in 100. The full call graph is too expensive to keep, but the 1% sample is where we find the slow-DB-query problems.

Latency budget

What “fast enough” means depends on who’s reading. For a partner whose checkout flow is interactive, p95 under 800 ms is the line. For a batch biller pushing 50k rows overnight, p99 under 4 s is fine — they care about throughput, not latency.

Endpoint p95 latency (ms)

95 ms

POST /auth/token

210 ms

GET /bill/lookup

340 ms

POST /bill/lock

620 ms

POST /bill/pay

140 ms

GET /txn/status

Pay is the slow one because it fans out to the biller + ledger + notification queue. Lookup is cached at the edge.

What I’d do differently

Ship a partner SDK on day 1. We let partners hand-roll clients for six months. Five of them got the HMAC signing wrong in five different ways. An SDK would have removed an entire class of support tickets.
Sign error responses too. We sign success responses; we don’t sign errors. A malicious middle-box could tell a partner “everything failed” to hide a successful payment from them. Unlikely in practice, but cheap to fix.
Make the trace id parsable. Ours is an opaque string. If it embedded the partner id and timestamp prefix, on-call could read it without looking it up.

Payment APIs are unglamorous — most of the design is about making the boring case rock-solid so the interesting case (a real outage, a real dispute) is the only one anyone has to think about.