🧩 CH5 - Building a Modern Payment Gateway with a Rebill Scheduler - Chapter 5 – Monitoring & Scalability
Why you might want to build your own payment gateway — and how to do it safely with an external PCI-compliant vault.
🩺 Why Monitoring Matters
A payment gateway doesn’t fail loudly — it fails silently:
timeouts, webhook delays, duplicate retries, subtle reconciliation gaps.
Without strong observability, you only find out when accounting calls.
That’s why monitoring, tracing, and auditing aren’t optional features —
they’re your operational lifeline.
🔍 Three Pillars of Observability
| Pillar | Tooling | Purpose |
|---|---|---|
| Logs | Structured (JSON) logs via Zerolog or Zap | Record events and errors with correlation IDs |
| Metrics | Prometheus / OpenTelemetry Metrics | Quantify throughput, latency, failures |
| Traces | OpenTelemetry Tracing + Jaeger | Follow a payment across microservices |
Each log, trace, and metric should share a correlation ID — for example, the payment’s IntentID or SubscriptionItemID.
📊 Metrics to Watch
Define SLOs (Service Level Objectives) around reliability and latency.
Here are key metrics you should expose:
| Metric | Description | Example target |
|---|---|---|
gateway_requests_total | Total API requests | — |
gateway_request_duration_seconds | Latency histogram | p95 < 400 ms |
payments_inflight_total | Ongoing intents not finalized | < 100 |
rebill_jobs_failed_total | Failed rebill attempts | 0 critical |
webhook_lag_seconds | Delay between PSP event → processed | < 30 s |
psp_call_errors_total | Provider API errors | < 0.1% |
reconciliation_mismatches_total | Discrepancies per day | 0 critical |
Expose them on /metrics for Prometheus scraping or via OTLP to your telemetry backend.
🧠 Tracing Example (OpenTelemetry)
ctx, span := tracer.Start(ctx, "CreateIntent")
defer span.End()
resp, err := psp.CreateIntent(ctx, req)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "psp failure")
}
span.SetAttributes(
attribute.String("psp", req.PSP),
attribute.String("intent.id", resp.Intent.ID),
attribute.String("tenant", req.TenantID),
)
In Jaeger or Grafana Tempo, this gives a full cross-service timeline:
API → Vault → Gateway → PSP → Webhook.
🧰 Logs and Audits
Structured logging is key:
log.Info().
Str("intent", intent.ID).
Str("psp", intent.PSP).
Str("status", intent.Status).
Msg("Payment captured successfully")
Each Payment, Refund, SubscriptionItem, and RebillJob should produce an audit entry:
-
Timestamp
-
Actor (API / Worker / Webhook)
-
Action (authorize, capture, rebill, reconcile)
-
Result and details
Store audits in a dedicated immutable table (payment_audit, rebill_audit) and expose them through a back-office dashboard for finance teams.
🧩 Distributed Scaling
Your gateway must scale horizontally without sacrificing consistency.
Stateless API nodes
-
Use Redis or a DB for locks / idempotency.
-
Load-balance requests with sticky sessions disabled.
Worker autoscaling
-
Each worker consumes from queues (Kafka, SQS, RabbitMQ).
-
Scale by queue depth or CPU usage.
-
Use consumer-group semantics to prevent double processing.
Database connections
-
Use connection pooling (pgx / pgbouncer).
-
Avoid long transactions during PSP calls.
-
Employ outbox pattern to publish events safely after commits.
🧱 Fault Tolerance Patterns
| Pattern | Purpose | Example |
|---|---|---|
| Circuit Breaker | Prevent flooding failing PSPs | Trip after N failures |
| Retry with Backoff | Smooth transient errors | Exponential + jitter |
| Outbox / Inbox | Guarantee event delivery once | Rebill → Webhook |
| Saga Pattern | Compensate failed multi-step ops | Refund after capture failure |
| Dead-Letter Queue | Persist unrecoverable jobs | Manual review |
Combine these patterns to achieve eventual consistency with auditability.
🔐 Secrets and Key Management
Treat credentials like radioactive material.
-
Store PSP API keys and Vault tokens in a KMS or Secret Manager (AWS KMS, GCP Secret Manager).
-
Use short-lived credentials when possible.
-
Rotate keys automatically every 90 days.
-
Separate production, staging, and sandbox environments entirely.
-
Log only the last 4 characters of any key when debugging.
🧮 Performance and Cost Optimization
-
Batch operations for reconciliation and webhook acknowledgments.
-
Use connection reuse with http.Transport and keep-alive.
-
Cache static vault mappings or PSP configuration in Redis.
-
Enable read replicas for analytics queries.
-
Compress large audit payloads before storage.
🧩 Health Checks and Alerts
Expose lightweight endpoints:
GET /healthz # system alive
GET /readyz # dependencies reachable
GET /metrics # Prometheus
Alert rules examples:
- alert: HighPSPErrorRate
expr: rate(psp_call_errors_total[5m]) > 0.01
for: 10m
labels: { severity: critical }
annotations:
description: "PSP failures exceed 1% over 5 min."
🚀 Scaling the Rebill Scheduler
For the rebill workers introduced in Chapter 4:
- Use a dedicated queue per frequency (rebill.daily, rebill.monthly) for parallelism.
- Distribute jobs by tenant → worker sharding key.
- Persist job offsets to resume after restarts.
- Monitor rebill_jobs_failed_total and rebill_queue_depth.
Each worker should emit heartbeats; the scheduler can reassign stale jobs automatically if a worker dies mid-rebill.
🚀 Coming Next
In Chapter 6 – Beyond the Gateway, we’ll bring everything together:
Building an operator dashboard in React + Go,
- Integrating with ERP / CRM systems, And sharing the lessons learned from building an end-to-end payment infrastructure.
Your gateway now operates at production scale — resilient, observable, and built for growth.