← Journal

How I Built Payment APIs: Architecture, Authentication & Integration

Notes on production-grade payment REST APIs — token authentication, two-phase commit, bill locking, idempotency keys, and the dispute-resolution patterns that actually hold up in audit.

Payment APIs sit at the worst end of the engineering trade-off curve: strict correctness, near-zero downtime, full audit, and yet they’re the APIs that financial-partner engineers will read line-by-line, on a Friday, before they ship a customer-facing flow that touches money. This post is a field-tested walk through the choices that hold up in that review.

The shape of a good payment endpoint

Every payment endpoint in our stack returns the same envelope:

{
  "status": "OK",
  "code": "P0000",
  "message": "Payment accepted",
  "data": {
    "transactionId": "TXN-2026-04-A7F3-9912",
    "amount": "1250.00",
    "currency": "BDT",
    "billingPeriod": "2026-04",
    "lockedAt": "2026-04-30T08:21:14Z"
  },
  "trace": "Z9-4d2a"
}

Three things matter here:

  • A short numeric code (P0000, P1001, …) that maps cleanly to a partner-side state machine. Free-text message fields are for humans; partners must never branch on them.
  • A trace id that ties a response back to a single row in the request log. That row is the source of truth in disputes.
  • No camelCase / snake_case mixing. Pick one. We chose camelCase because most partner-side stacks today are TypeScript-ish, and partners care more about typing simply than about server-side aesthetic.

Authentication

The cleanest pattern we’ve shipped is a two-step token flow:

  1. Partner POSTs API key + signed timestamp to /auth/token.
  2. Server returns a short-lived (15 min) JWT.
  3. Every subsequent call carries Authorization: Bearer <jwt>.

The signed-timestamp step matters: it prevents replay if an API key leaks into a partner’s logs, because the signature is only valid for a 60-second window.

The JWT body looks like this:

{
  "iss": "mtb.payments",
  "sub": "partner_12",
  "aud": "bill-collection",
  "scope": ["read:bill", "post:payment"],
  "exp": 1714467600,
  "ipHash": "a7…"
}

scope lets us issue read-only tokens to partner ops teams without exposing the payment posting endpoint. ipHash rejects a token used from outside the partner’s whitelisted IP range — the only secret that ever leaves our edge is the JWT, never the API key.

Two-phase payment commit

The hardest class of bug in payments is the “partner thinks it succeeded, biller doesn’t see it” bug. We solve it with an explicit two-phase commit:

  1. Lock — partner calls POST /bill/lock to reserve the bill. Server returns a lockToken valid for 5 minutes. The biller’s downstream system now refuses any other payment on that bill until the lock expires.
  2. Pay — partner calls POST /bill/pay with the lockToken and their transaction id. Server commits and returns a final transactionId.

The lock step is what makes the system safe under partner-side retries. Without it, a partner that retries because of a timeout on the pay step can double-debit the customer. With it, the second attempt fails closed with a specific P1102 — Lock Expired and the partner knows to call /bill/lock again.

Idempotency keys

Every state-changing endpoint accepts an optional Idempotency-Key header. We store the (partner, key, response) tuple for 24 hours. A retry with the same key returns the original response byte-for-byte — including trace — so the partner sees a stable view.

The TTL is 24 hours and not “forever” because partners regenerate keys per day; storing them indefinitely is a slow leak of memory and a fast leak of PII.

Observability

Three telemetry surfaces, every payment touches all three:

  • A canonical log line per request: trace, partner, endpoint, request hash, response code, total latency, and whether each downstream call hit cache or origin. This is the only thing dispute-resolution looks at.
  • A metric: counter on (partner, endpoint, response_code). We alert on a 5-minute window where any partner’s error rate climbs above 2%.
  • A distributed trace, sampled at 1 in 100. The full call graph is too expensive to keep, but the 1% sample is where we find the slow-DB-query problems.

Latency budget

What “fast enough” means depends on who’s reading. For a partner whose checkout flow is interactive, p95 under 800 ms is the line. For a batch biller pushing 50k rows overnight, p99 under 4 s is fine — they care about throughput, not latency.

Endpoint p95 latency (ms)
95 ms
POST /auth/token
210 ms
GET /bill/lookup
340 ms
POST /bill/lock
620 ms
POST /bill/pay
140 ms
GET /txn/status
Pay is the slow one because it fans out to the biller + ledger + notification queue. Lookup is cached at the edge.

What I’d do differently

  • Ship a partner SDK on day 1. We let partners hand-roll clients for six months. Five of them got the HMAC signing wrong in five different ways. An SDK would have removed an entire class of support tickets.
  • Sign error responses too. We sign success responses; we don’t sign errors. A malicious middle-box could tell a partner “everything failed” to hide a successful payment from them. Unlikely in practice, but cheap to fix.
  • Make the trace id parsable. Ours is an opaque string. If it embedded the partner id and timestamp prefix, on-call could read it without looking it up.

Payment APIs are unglamorous — most of the design is about making the boring case rock-solid so the interesting case (a real outage, a real dispute) is the only one anyone has to think about.