🔧 Dev

OffreThe Augmented Engineering Programdès 2 500 € / mois

Tech Lead & équipe · ou MVP livré en 3 mois + recrutement

Découvrir

🧩 CH5 - Building a Modern Payment Gateway with a Rebill Scheduler - Chapter 5 – Monitoring & Scalability

Sébastien Techer07/11/2025

Why you might want to build your own payment gateway — and how to do it safely with an external PCI-compliant vault.

🩺 Why Monitoring Matters

A payment gateway doesn’t fail loudly — it fails silently:
timeouts, webhook delays, duplicate retries, subtle reconciliation gaps.

Without strong observability, you only find out when accounting calls.
That’s why monitoring, tracing, and auditing aren’t optional features —
they’re your operational lifeline.


🔍 Three Pillars of Observability

PillarToolingPurpose
LogsStructured (JSON) logs via Zerolog or ZapRecord events and errors with correlation IDs
MetricsPrometheus / OpenTelemetry MetricsQuantify throughput, latency, failures
TracesOpenTelemetry Tracing + JaegerFollow a payment across microservices

Each log, trace, and metric should share a correlation ID — for example, the payment’s IntentID or SubscriptionItemID.


📊 Metrics to Watch

Define SLOs (Service Level Objectives) around reliability and latency.
Here are key metrics you should expose:

MetricDescriptionExample target
gateway_requests_totalTotal API requests
gateway_request_duration_secondsLatency histogramp95 < 400 ms
payments_inflight_totalOngoing intents not finalized< 100
rebill_jobs_failed_totalFailed rebill attempts0 critical
webhook_lag_secondsDelay between PSP event → processed< 30 s
psp_call_errors_totalProvider API errors< 0.1%
reconciliation_mismatches_totalDiscrepancies per day0 critical

Expose them on /metrics for Prometheus scraping or via OTLP to your telemetry backend.


🧠 Tracing Example (OpenTelemetry)

ctx, span := tracer.Start(ctx, "CreateIntent")
defer span.End()

resp, err := psp.CreateIntent(ctx, req)
if err != nil {
  span.RecordError(err)
  span.SetStatus(codes.Error, "psp failure")
}
span.SetAttributes(
  attribute.String("psp", req.PSP),
  attribute.String("intent.id", resp.Intent.ID),
  attribute.String("tenant", req.TenantID),
)

In Jaeger or Grafana Tempo, this gives a full cross-service timeline: API → Vault → Gateway → PSP → Webhook.

🧰 Logs and Audits

Structured logging is key:

log.Info().
  Str("intent", intent.ID).
  Str("psp", intent.PSP).
  Str("status", intent.Status).
  Msg("Payment captured successfully")

Each Payment, Refund, SubscriptionItem, and RebillJob should produce an audit entry:

  • Timestamp

  • Actor (API / Worker / Webhook)

  • Action (authorize, capture, rebill, reconcile)

  • Result and details

Store audits in a dedicated immutable table (payment_audit, rebill_audit) and expose them through a back-office dashboard for finance teams.

🧩 Distributed Scaling

Your gateway must scale horizontally without sacrificing consistency.

Stateless API nodes

  • Use Redis or a DB for locks / idempotency.

  • Load-balance requests with sticky sessions disabled.

Worker autoscaling

  • Each worker consumes from queues (Kafka, SQS, RabbitMQ).

  • Scale by queue depth or CPU usage.

  • Use consumer-group semantics to prevent double processing.

Database connections

  • Use connection pooling (pgx / pgbouncer).

  • Avoid long transactions during PSP calls.

  • Employ outbox pattern to publish events safely after commits.

🧱 Fault Tolerance Patterns

PatternPurposeExample
Circuit BreakerPrevent flooding failing PSPsTrip after N failures
Retry with BackoffSmooth transient errorsExponential + jitter
Outbox / InboxGuarantee event delivery onceRebill → Webhook
Saga PatternCompensate failed multi-step opsRefund after capture failure
Dead-Letter QueuePersist unrecoverable jobsManual review

Combine these patterns to achieve eventual consistency with auditability.

🔐 Secrets and Key Management

Treat credentials like radioactive material.

  • Store PSP API keys and Vault tokens in a KMS or Secret Manager (AWS KMS, GCP Secret Manager).

  • Use short-lived credentials when possible.

  • Rotate keys automatically every 90 days.

  • Separate production, staging, and sandbox environments entirely.

  • Log only the last 4 characters of any key when debugging.

🧮 Performance and Cost Optimization

  1. Batch operations for reconciliation and webhook acknowledgments.

  2. Use connection reuse with http.Transport and keep-alive.

  3. Cache static vault mappings or PSP configuration in Redis.

  4. Enable read replicas for analytics queries.

  5. Compress large audit payloads before storage.

🧩 Health Checks and Alerts

Expose lightweight endpoints:

GET /healthz         # system alive
GET /readyz          # dependencies reachable
GET /metrics         # Prometheus

Alert rules examples:

- alert: HighPSPErrorRate
  expr: rate(psp_call_errors_total[5m]) > 0.01
  for: 10m
  labels: { severity: critical }
  annotations:
    description: "PSP failures exceed 1% over 5 min."

🚀 Scaling the Rebill Scheduler

For the rebill workers introduced in Chapter 4:

  • Use a dedicated queue per frequency (rebill.daily, rebill.monthly) for parallelism.
  • Distribute jobs by tenant → worker sharding key.
  • Persist job offsets to resume after restarts.
  • Monitor rebill_jobs_failed_total and rebill_queue_depth.

Each worker should emit heartbeats; the scheduler can reassign stale jobs automatically if a worker dies mid-rebill.

🚀 Coming Next

In Chapter 6 – Beyond the Gateway, we’ll bring everything together:

Building an operator dashboard in React + Go,

  • Integrating with ERP / CRM systems, And sharing the lessons learned from building an end-to-end payment infrastructure.

Your gateway now operates at production scale — resilient, observable, and built for growth.

Respect de votre vie privée

Nous utilisons des cookies pour améliorer votre expérience, analyser le trafic et personnaliser le contenu. Vous pouvez choisir quels cookies accepter.