Expert Technique Senior — Architecture Résiliente & Sécurité

🩺 Why Monitoring Matters

A payment gateway doesn’t fail loudly — it fails silently:
timeouts, webhook delays, duplicate retries, subtle reconciliation gaps.

Without strong observability, you only find out when accounting calls.
That’s why monitoring, tracing, and auditing aren’t optional features —
they’re your operational lifeline.

🔍 Three Pillars of Observability

Pillar	Tooling	Purpose
Logs	Structured (JSON) logs via Zerolog or Zap	Record events and errors with correlation IDs
Metrics	Prometheus / OpenTelemetry Metrics	Quantify throughput, latency, failures
Traces	OpenTelemetry Tracing + Jaeger	Follow a payment across microservices

Each log, trace, and metric should share a correlation ID — for example, the payment’s IntentID or SubscriptionItemID.

📊 Metrics to Watch

Define SLOs (Service Level Objectives) around reliability and latency.
Here are key metrics you should expose:

Metric	Description	Example target
`gateway_requests_total`	Total API requests	—
`gateway_request_duration_seconds`	Latency histogram	p95 < 400 ms
`payments_inflight_total`	Ongoing intents not finalized	< 100
`rebill_jobs_failed_total`	Failed rebill attempts	0 critical
`webhook_lag_seconds`	Delay between PSP event → processed	< 30 s
`psp_call_errors_total`	Provider API errors	< 0.1%
`reconciliation_mismatches_total`	Discrepancies per day	0 critical

Expose them on /metrics for Prometheus scraping or via OTLP to your telemetry backend.

🧠 Tracing Example (OpenTelemetry)

ctx, span := tracer.Start(ctx, "CreateIntent")
defer span.End()

resp, err := psp.CreateIntent(ctx, req)
if err != nil {
  span.RecordError(err)
  span.SetStatus(codes.Error, "psp failure")
}
span.SetAttributes(
  attribute.String("psp", req.PSP),
  attribute.String("intent.id", resp.Intent.ID),
  attribute.String("tenant", req.TenantID),
)

In Jaeger or Grafana Tempo, this gives a full cross-service timeline: API → Vault → Gateway → PSP → Webhook.

🧰 Logs and Audits

Structured logging is key:

log.Info().
  Str("intent", intent.ID).
  Str("psp", intent.PSP).
  Str("status", intent.Status).
  Msg("Payment captured successfully")

Each Payment, Refund, SubscriptionItem, and RebillJob should produce an audit entry:

Timestamp
Actor (API / Worker / Webhook)
Action (authorize, capture, rebill, reconcile)
Result and details

Store audits in a dedicated immutable table (payment_audit, rebill_audit) and expose them through a back-office dashboard for finance teams.

🧩 Distributed Scaling

Your gateway must scale horizontally without sacrificing consistency.

Stateless API nodes

Use Redis or a DB for locks / idempotency.
Load-balance requests with sticky sessions disabled.

Worker autoscaling

Each worker consumes from queues (Kafka, SQS, RabbitMQ).
Scale by queue depth or CPU usage.
Use consumer-group semantics to prevent double processing.

Database connections

Use connection pooling (pgx / pgbouncer).
Avoid long transactions during PSP calls.
Employ outbox pattern to publish events safely after commits.

🧱 Fault Tolerance Patterns

Pattern	Purpose	Example
Circuit Breaker	Prevent flooding failing PSPs	Trip after N failures
Retry with Backoff	Smooth transient errors	Exponential + jitter
Outbox / Inbox	Guarantee event delivery once	Rebill → Webhook
Saga Pattern	Compensate failed multi-step ops	Refund after capture failure
Dead-Letter Queue	Persist unrecoverable jobs	Manual review

Combine these patterns to achieve eventual consistency with auditability.

🔐 Secrets and Key Management

Treat credentials like radioactive material.

Store PSP API keys and Vault tokens in a KMS or Secret Manager (AWS KMS, GCP Secret Manager).
Use short-lived credentials when possible.
Rotate keys automatically every 90 days.
Separate production, staging, and sandbox environments entirely.
Log only the last 4 characters of any key when debugging.

🧮 Performance and Cost Optimization

Batch operations for reconciliation and webhook acknowledgments.
Use connection reuse with http.Transport and keep-alive.
Cache static vault mappings or PSP configuration in Redis.
Enable read replicas for analytics queries.
Compress large audit payloads before storage.

🧩 Health Checks and Alerts

Expose lightweight endpoints:

GET /healthz         # system alive
GET /readyz          # dependencies reachable
GET /metrics         # Prometheus

Alert rules examples:

- alert: HighPSPErrorRate
  expr: rate(psp_call_errors_total[5m]) > 0.01
  for: 10m
  labels: { severity: critical }
  annotations:
    description: "PSP failures exceed 1% over 5 min."

🚀 Scaling the Rebill Scheduler

For the rebill workers introduced in Chapter 4:

Use a dedicated queue per frequency (rebill.daily, rebill.monthly) for parallelism.
Distribute jobs by tenant → worker sharding key.
Persist job offsets to resume after restarts.
Monitor rebill_jobs_failed_total and rebill_queue_depth.

Each worker should emit heartbeats; the scheduler can reassign stale jobs automatically if a worker dies mid-rebill.

🚀 Coming Next

In Chapter 6 – Beyond the Gateway, we’ll bring everything together:

Building an operator dashboard in React + Go,

Integrating with ERP / CRM systems, And sharing the lessons learned from building an end-to-end payment infrastructure.

Your gateway now operates at production scale — resilient, observable, and built for growth.

🧩 CH5 - Building a Modern Payment Gateway with a Rebill Scheduler - Chapter 5 – Monitoring & Scalability