Building a Dispute Automation System in .NET 9

A bill dispute sounds prosaic until you trace one. A customer claims they paid; our ledger says they didn’t; the biller says we did pay them; the partner says the customer never asked. Resolving that takes a person 20 minutes today; the goal of this system was to take it to under 2 minutes across the cases that don’t actually need a human.

This is the architecture we shipped on .NET 9, the patterns we picked, and the parts where I’d choose differently next time.

The shape of the API

POST /api/v1/disputes                  → create
GET  /api/v1/disputes/{id}             → fetch one
GET  /api/v1/disputes?status=open      → query
POST /api/v1/disputes/{id}/resolve     → close (or escalate)
POST /api/v1/disputes/{id}/refund      → ledger refund

REST is enough. We considered gRPC for the partner-side fan-out; we rejected it because the partners we onboard most often (mid-tier banks) have HTTP/1.1 stacks and no appetite for a new transport.

Layering

A boring five-layer cake, with one twist (the service invoker):

Controllers          → HTTP surface, validation, response shaping
↓
Services             → orchestration, business rules
↓ (via ServiceInvoker)
External Connectors  → biller APIs, ledger system, notification queue
↓
Repositories         → data access (Dapper over SQL Server)
↓
Domain models        → POCOs, value objects

Controllers are kept dumb. They:

Validate the request (FluentValidation).
Resolve the right service.
Wrap the call in a try/catch that the global exception filter handles.
Map domain results to the canonical response envelope.

They do not know about repositories, connectors, or transactions.

The Service Invoker

The non-obvious pattern in this design is the service invoker. Every external call goes through it:

public sealed class ServiceInvoker : IServiceInvoker
{
    private readonly ILogger<ServiceInvoker> _log;
    private readonly IHttpClientFactory _http;
    private readonly IOptions<DownstreamOptions> _opts;

    public async Task<Result<TResp>> Invoke<TReq, TResp>(
        string endpointName,
        TReq payload,
        CancellationToken ct = default)
    {
        var ep = _opts.Value.Endpoints[endpointName];
        using var client = _http.CreateClient(ep.HttpClientName);
        var traceId = Activity.Current?.TraceId.ToString() ?? Guid.NewGuid().ToString("N");

        _log.LogInformation("Invoke {Endpoint} trace={Trace} payload={Payload}",
            endpointName, traceId, payload);

        var sw = Stopwatch.StartNew();
        try
        {
            var resp = await client.PostAsJsonAsync(ep.Path, payload, ct);
            var body = await resp.Content.ReadFromJsonAsync<TResp>(cancellationToken: ct);
            _log.LogInformation("Invoke {Endpoint} status={Status} ms={Ms}",
                endpointName, (int)resp.StatusCode, sw.ElapsedMilliseconds);

            return resp.IsSuccessStatusCode
                ? Result<TResp>.Ok(body!)
                : Result<TResp>.Fail($"{endpointName} returned {(int)resp.StatusCode}");
        }
        catch (Exception ex)
        {
            _log.LogError(ex, "Invoke {Endpoint} threw after {Ms}ms",
                endpointName, sw.ElapsedMilliseconds);
            return Result<TResp>.Fail(ex.Message);
        }
    }
}

Three benefits:

One place to log every downstream call. The auditors love this.
One place to configure timeouts, retries and headers per partner. No more sprinkling Polly policies across the codebase.
Easy unit tests. Services depend on IServiceInvoker, which is trivial to mock.

Repository pattern, lightly

We use the repository pattern, but only as a seam, not as a deep abstraction. Repositories are thin Dapper wrappers; we don’t try to make them database-agnostic. The goal is testability, not portability.

public interface IDisputeRepository
{
    Task<Dispute?> GetAsync(Guid id, CancellationToken ct);
    Task<Guid>    InsertAsync(Dispute d, CancellationToken ct);
    Task<int>     UpdateStatusAsync(Guid id, DisputeStatus s, CancellationToken ct);
    Task<IReadOnlyList<Dispute>> QueryAsync(DisputeQuery q, CancellationToken ct);
}

The one rule we enforce: no SQL outside repositories. If a service needs data, it asks a repository. If the query is too specific to fit a generic repository method, we add a new method on the repository — we do not let callers pass arbitrary SQL fragments. This rule, more than any other, has kept the project’s blast radius small.

Two-level logging

Every request gets two log surfaces:

Audit log — to SQL, indexed on (partner, customer, dispute_id). One row per state transition. Compliance reads this.
Application log — to a JSON file, shipped to Loki. Engineers read this.

Both share a traceId. The audit log is intentionally low-cardinality; the app log is intentionally high-cardinality. Mixing the two is the fastest way to ruin both.

Configuration

appsettings.json holds non-secret toggles; secrets live in Azure Key Vault, mounted at startup. Per-environment overrides are appsettings.Production.json and an IConfigurationSource that pulls partner-specific endpoints from a SQL table at boot.

The partner-endpoint table looks like this:

partner_id	endpoint_name	url	timeout_ms	retry_count
BANK_A	bill-lookup	https://core-a/billing/lookup	1500	2
BANK_A	bill-refund	https://core-a/billing/refund	4000	0
BANK_B	bill-lookup	https://core-b/api/v2/lookup	2000	1

When ops onboards a new partner, they add rows here. No deploy.

Onboarding a new client (the punchlist)

This is the runbook ops follows; engineering provides one new SQL row and two config entries:

Ops signs the Technical Schedule with the partner.
Engineering adds the partner to the partners table (id, name, public_key).
Engineering adds endpoint rows to partner_endpoints.
Engineering issues an mTLS client certificate via Key Vault.
Ops runs the /healthz/partner/{partner_id} smoke probe.
Ops flips the partner to active = true.

The smoke probe is critical: it exercises every endpoint with a synthetic zero-amount transaction. If any endpoint fails, the partner stays inactive until it doesn’t.

Throughput

Pre-rewrite (a PHP monolith), the system handled ~40 disputes per minute under sustained load before queue depth blew up. The .NET 9 rewrite hit the following profile on a single 4-core pod:

Disputes processed per minute (single pod)

PHP (old)

145

.NET 5

198

.NET 7

247

.NET 9 (current)

The bulk of the .NET 9 win is the new async I/O scheduler — our workload is almost entirely network-bound on the biller calls.

What I’d change

Source-generate the canonical envelope. We hand-wrote it. A Roslyn source generator that produces the response classes from a single schema would have saved a hundred small edits.
Move audit writes off the request path. We block the response on the audit log write. It’s fast, but a queue with redelivery on failure would be more honest about the durability requirement.
Adopt OpenTelemetry from the start. We bolted it on six months in. The instrumentation density is uneven, and we keep finding code paths that silently break the trace.

The biggest lesson is unsexy: the patterns that paid off (ServiceInvoker, thin repositories, the partner-endpoint table) are all about single points of variation. Anywhere a partner could be different from another partner — timeouts, headers, signing — we forced into a single, declarative table or a single class. That’s what makes onboarding a new partner cheap; it’s the only metric the business cares about.