Back to work
2024·DevOps · ObservabilityPRODUCTION

NIMBUS

Observability that respects your time.

A high-cardinality metrics + log + trace platform built for engineering teams that find Datadog too expensive and Grafana too DIY. Sane defaults, exceptional ergonomics.

Duration
18 weeks
Team
2 engineers
My role
Backend lead + UI
Outcome
Events per second
1.2M
peak load
Customer teams
340
MTTR reduction
−68%
Storage cost
−54%
The story

From brief to production system.

Challenge

Mid-size teams are stuck: Datadog scales costs faster than usage, Grafana stack requires a dedicated platform engineer. Logs and metrics live in silos. Incident timelines are stitched manually in Slack.

Solution

ClickHouse-backed columnar storage with Kafka ingestion. Trace-correlated logs by default. AI-assisted incident timelines that auto-stitch deploys, alerts, and Slack chatter into a single audit trail. PromQL + LogQL compatible query layer.

Outcome

MTTR dropped 68% across 340 customer teams. Storage cost reduced 54% vs equivalent Datadog usage. Used by 3 YC-backed startups + 1 unicorn DevOps team. Now processing 1.2M events/sec at peak.

Process · 18 weeks

How it shipped, week by week.

Week 1-3
01 / 5

Architecture + ClickHouse PoC

Benched ClickHouse against TimescaleDB and InfluxDB. ClickHouse won on storage compression (4.2x) and query speed at our cardinality.

Week 4-8
02 / 5

Ingestion + storage

Spring Boot ingestion gateway. Kafka buffer for back-pressure tolerance. Schema design optimized for compression — saved 54% vs naive layout.

Week 9-13
03 / 5

Query layer + dashboards

Custom planner translating PromQL → ClickHouse SQL. React dashboarding with 22 widget types. ECharts for performance over D3 at this scale.

Week 14-16
04 / 5

AI incident timelines

Stitched alerts + deploys + Slack chatter into auto-generated post-mortems. Used Claude to summarize. Saved oncall an average of 40min per incident.

Week 17-18
05 / 5

Production rollout

Migrated 340 customer teams over a phased rollout. Zero data loss. Cut over from the legacy stack in three days.

Inside the system

What it does. How it's built.

Features

  • Unified metrics + logs + traces
  • AI-assisted incident timeline reconstruction
  • Cost-aware retention policies (hot → warm → cold)
  • Slack-native alerting + chatops
  • OpenTelemetry-first ingestion
  • PromQL + LogQL compatible query layer
  • Custom dashboarding with 22 widget types
  • On-call runbook embedded in alerts

Architecture

  • 01Spring Boot ingestion gateway
  • 02Kafka for buffered event streams (3 brokers, 18 partitions)
  • 03ClickHouse cluster (3-node) for storage
  • 04React + ECharts dashboards
  • 05Custom query planner translating PromQL → SQL
  • 06Deployed on AWS ECS Fargate
  • 07S3 for cold storage + Athena for archival queries
  • 08Slack bot built on the Bolt SDK
Stack
JavaSpring BootReactClickHouseKafkaDockerAWS ECSPrometheusOpenTelemetry
From the codebase

Annotated excerpts.

01 · Ingestion gateway: routes events to Kafka topics with backpressure-aware sampling.
ingest/EventRouter.javajava
@Component
public class EventRouter {
    private final KafkaTemplate<String, Event> kafka;
    private final SamplingPolicy sampler;

    public CompletableFuture<Ack> route(Event event) {
        if (!sampler.accept(event)) {
            return Ack.dropped(event.id());
        }

        var topic = switch (event.kind()) {
            case METRIC -> "metrics.raw";
            case LOG    -> "logs.raw";
            case TRACE  -> "traces.raw";
        };

        return kafka.send(topic, event.tenantId(), event)
            .thenApply(r -> Ack.accepted(event.id(), r.getRecordMetadata().offset()))
            .exceptionally(ex -> Ack.failed(event.id(), ex));
    }
}
02 · ClickHouse schema tuned for cardinality + compression.
schema/metrics.sqlsql
CREATE TABLE metrics (
    tenant_id   LowCardinality(String),
    metric_name LowCardinality(String),
    timestamp   DateTime64(3),
    value       Float64,
    labels      Map(LowCardinality(String), String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (tenant_id, metric_name, timestamp)
TTL timestamp + INTERVAL 30 DAY TO VOLUME 'cold',
    timestamp + INTERVAL 90 DAY DELETE
SETTINGS storage_policy = 'tiered';
What the client said
We were burning $14k/month on a managed observability stack. Ali designed and shipped a self-hosted alternative that's now cheaper, faster, and easier to query. Paid for itself in 60 days.
PN
Priya Nair
Head of Engineering · Ledgerline
Other projects

Continue browsing

Have a project like this in mind? Let's talk.

Send me a brief and I'll respond within 24 hours.

← Home© 2025 Ali RazzaqContact →