Back

Zenith β€” Distributed Event Observer

Cloud-native Go event pipeline: high-throughput ingestion, dynamic rule evaluation, and multi-sink dispatch. gRPC, CockroachDB, OpenTelemetry, Kubernetes, Terraform on GCP.

Contributors


πŸ“– The Problem: "When Volume Kills Visibility"

Modern distributed systems generate massive event streams continuously β€” payments, clicks, errors, user actions. Without dedicated infrastructure, these events vanish into unreadable logs or bury teams under thousands of unsorted alerts. Every signal's value drowns in the noise.

Zenith is the answer: a cloud-native Go pipeline that intercepts incoming events, evaluates each one against a set of dynamic business rules, and automatically triggers the right actions to the right recipients.

A concrete example:

If purchase.amount > 500 AND user.segment == 'VIP', trigger a high-priority Slack alert. Otherwise, silently archive to S3.

What sets Zenith apart from a basic webhook or a simple Kafka worker is its three-layer pipeline architecture with a dynamic JSONB rule engine, end-to-end observability via OpenTelemetry, and a reproducible cloud-native deployment on GCP.


πŸ—οΈ Architecture: 3-Layer Pipeline

Zenith Architecture

The pipeline follows a linear, non-blocking flow where each layer communicates with the next via in-process Go channels β€” zero network overhead.

1. Ingestor β€” The Entry Point

The ingestion service implements IngestorService via ConnectRPC, a layer that simultaneously exposes a native gRPC endpoint and a REST endpoint (HTTP/2 via h2c), with no additional configuration. Each incoming request is:

  1. Authenticated via X-Api-Key (bcrypt hashing, key bound to a Source)
  2. Validated and parsed into a domain.Event
  3. Enqueued into the engine_in channel toward the Rule Engine

A configurable goroutine pool (ENGINE_WORKER_COUNT, default 10) ensures ingestion never blocks, even under load.

2. Rule Engine β€” The Brain

The rule engine evaluates each event against all active rules linked to its source. Rules are stored as JSONB in CockroachDB, with a structured condition format:

{"field": "amount", "operator": ">", "value": 1000}

6 supported operators (==, !=, >, >=, <, <=) over dynamic types (numeric and string). Pure condition evaluation is zero-allocation on the hot path β€” 13.6 ns per evaluation measured in benchmarks.

Matched events are sent into the engine_out channel toward the Dispatcher.

3. Dispatcher β€” The Action Router

The Dispatcher orchestrates a worker pool that consumes matched events and routes them to the appropriate sinks (Discord, HTTP webhooks, S3). Each dispatch implements retry with exponential backoff and writes an audit_log record to CockroachDB β€” success or failure, every action is traceable.

Why Go Channels

Sub-millisecond latency (in-process channels). Planned evolution to NATS/Kafka for independent component scaling.

Choosing Go channels over an external queue is not a simplification β€” it's a deliberate architectural decision, latency takes priority over independent scalability.


🧠 Under the Hood: The Real Challenges

Challenge 1 β€” Graceful Shutdown Without Event Loss

A SIGTERM during evaluation can silently drop in-flight events. The solution is a complete, ordered pipeline drain:

SIGTERM received
  β†’ Stop accepting (no new requests)
  β†’ Drain engine_in (flush events waiting for evaluation)
  β†’ Drain engine_out (flush matched events waiting for dispatch)
  β†’ Clean exit

Goroutine coordination uses a context.Context with propagated cancellation β€” each stage waits for its input channel to close before stopping, guaranteeing no in-flight event is lost during a Kubernetes rolling deployment.

Challenge 2 β€” JSONB Rule Evaluation Without an ORM

Conditions are stored as json.RawMessage in CockroachDB. The evaluation engine (internal/engine/evaluator.go) dynamically decodes each condition and applies the operator on inferred types β€” no reflection, no allocation on the hot path.

Benchmark results:

  • Pure condition (zero allocation): 13.6 ns per evaluation
  • Full Evaluator.Evaluate() (JSON + repo lookup): 3.2 Β΅s
  • JSON unmarshal accounts for ~95% of the cost β€” the evaluation logic itself is negligible.

Challenge 3 β€” Integration Tests with testcontainers-go

Repository tests in internal/repository/postgres/ spin up a real CockroachDB instance via testcontainers-go, apply golang-migrate migrations, then run against a live database. Zero mocks, zero possible drift between test and production.

The rationale is defensible: unit tests with a mocked database hide migration bugs. Here, if the migration passes in CI, it passes in prod.


πŸ”­ Observability: Tracing Every Event

Grafana Dashboard

OpenTelemetry β€” From Ingestion to Dispatch

Every event generates a complete span tree, propagated via the W3C traceparent header:

SpanKey Attributes
ingestor.gateway.handle (root)event_id, source, event_type
engine.evaluate_rulessource, total_rules, matched_count
dispatcher.dispatchrule_id, sink_type, status

Traces are exported to GCP Cloud Trace in production (via OTLP), or to Jaeger locally (docker-compose included in deployments/monitoring/). W3C propagation allows correlating a client request with its dispatch action, across all pipeline components.

Prometheus β€” 9 Metrics, 3 Categories

CategoryKey Metrics
Ingestionevents_received_total, events_accepted_total, events_rejected_total (labels source, reason)
Evaluationrules_evaluated_total, rules_matched_total, rule_evaluation_duration_seconds
Dispatchdispatch_total (success/failed), dispatch_duration_seconds, worker_queue_depth

The pre-built Grafana dashboard (deployments/grafana/zenith-dashboard.json) visualizes the entire pipeline in a single view: throughput, acceptance rate, dispatch success rate, p95/p99 latency, and live queue depth.


☁️ Infrastructure: Cloud-Native on GCP

ComponentTechnologyRole
RuntimeGo 1.24Ingestor + Dispatcher binaries
DatabaseCockroachDB ServerlessACID, distributed, PostgreSQL-compatible
ContainersKubernetes (Kind local / GKE cloud)Orchestration, HPA, resource limits
CloudCloud Run + Artifact RegistryScalable serverless deployment, git SHA-tagged images
SecretsGCP Secret ManagerSecure injection of DATABASE_URL and API_KEY_SALT
CI/CDGitHub Actions + Workload IdentityKeyless deployment, zero stored credentials
IaCTerraformReproducible provisioning of the entire GCP infrastructure

Cloud Architecture

Credential-Free CI/CD

The GitHub Actions pipeline authenticates CI via Workload Identity Federation β€” no service account key is ever stored in GitHub secrets. CI calls Terraform directly with project_id and image_tag (git SHA); the rest is handled by Secret Manager on the GCP side.

HPA β€” Automatic Scaling Under Load

The Horizontal Pod Autoscaler is configured on the CPU metric (60% target), with min 1 / max 5 replicas. Under increasing load:

RPSCPUReplicasStatus
20040%1Nominal
40065%2Scale-out triggered
60050% / pod3Rebalanced
100045% / pod5Maximum capacity

Time for a new replica to become operational: ~45 seconds (15s HPA check + 10s pod startup + readiness probe).


πŸ“Š Performance: The Numbers

Benchmark

Rule Engine

MeasurementResultAllocations
Pure condition (1 rule)13.6 ns0
Full Evaluator.Evaluate()3.2 Β΅s30
100 rules evaluated14.0 Β΅s (condition) / 98.9 Β΅s (full)357 / 1157

Zero allocation on the hot path means adding new operators or condition types won't degrade performance β€” the evaluation logic is fully decoupled from allocation cost.

Gateway

The real bottleneck isn't CPU β€” it's the CockroachDB round-trip (~10 ms) for ListBySourceID() on every event. At 50 RPS, p50 latency is 50–100 ms. HPA kicks in before queues saturate.

REST vs gRPC

Both protocols use HTTP/2 via h2c (ConnectRPC). The measured latency difference is < 1% β€” Protocol Buffers serialization cost is negligible against the database round-trip. The choice between REST and gRPC depends on client constraints, not server performance.


πŸš€ What's Next: & Message Broker

The current architecture couples the Ingestor and Rule Engine in a single process via channels. This is optimal for latency, but scaling ingestion also scales evaluation β€” the two components are linked.

Decoupling via message broker:

[Ingestor] β†’ [NATS / Kafka / GCP Pub/Sub] β†’ [Rule Engine] β†’ [Dispatcher]

Tradeoff: +~100 ms latency (network hop + buffering) in exchange for fully independent horizontal scaling. Above a certain scale (>1000 sustained RPS), this tradeoff becomes obvious.

Other planned improvements: in-memory rule caching (eliminates the per-event DB round-trip), composite rules (AND/OR conditions), and multi-sink evaluation on a single event.


Technical expertise: Go (goroutines, channels, interfaces), ConnectRPC/gRPC, Protocol Buffers, CockroachDB, raw SQL (pgx/v5), OpenTelemetry, Prometheus, Grafana, Kubernetes (CKAD), Terraform, GCP (Cloud Run, Secret Manager, Cloud Trace, Artifact Registry, Workload Identity), GitHub Actions, testcontainers-go, Docker, golang-migrate.