2024·DevOps · ObservabilityPRODUCTION

NIMBUS

Observability that respects your time.

A high-cardinality metrics + log + trace platform built for engineering teams that find Datadog too expensive and Grafana too DIY. Sane defaults, exceptional ergonomics.

Duration

18 weeks

Team

2 engineers

My role

Backend lead + UI

Outcome

Events per second

1.2M

peak load

Customer teams

340

MTTR reduction

−68%

Storage cost

−54%

The story

From brief to production system.

Challenge

Mid-size teams are stuck: Datadog scales costs faster than usage, Grafana stack requires a dedicated platform engineer. Logs and metrics live in silos. Incident timelines are stitched manually in Slack.

Solution

ClickHouse-backed columnar storage with Kafka ingestion. Trace-correlated logs by default. AI-assisted incident timelines that auto-stitch deploys, alerts, and Slack chatter into a single audit trail. PromQL + LogQL compatible query layer.

Outcome

MTTR dropped 68% across 340 customer teams. Storage cost reduced 54% vs equivalent Datadog usage. Used by 3 YC-backed startups + 1 unicorn DevOps team. Now processing 1.2M events/sec at peak.

Process · 18 weeks

How it shipped, week by week.

Week 1-3

01 / 5

Architecture + ClickHouse PoC

Benched ClickHouse against TimescaleDB and InfluxDB. ClickHouse won on storage compression (4.2x) and query speed at our cardinality.

Week 4-8

02 / 5

Ingestion + storage

Spring Boot ingestion gateway. Kafka buffer for back-pressure tolerance. Schema design optimized for compression — saved 54% vs naive layout.

Week 9-13

03 / 5

Query layer + dashboards

Custom planner translating PromQL → ClickHouse SQL. React dashboarding with 22 widget types. ECharts for performance over D3 at this scale.

Week 14-16

04 / 5

AI incident timelines

Stitched alerts + deploys + Slack chatter into auto-generated post-mortems. Used Claude to summarize. Saved oncall an average of 40min per incident.

Week 17-18

05 / 5

Production rollout

Migrated 340 customer teams over a phased rollout. Zero data loss. Cut over from the legacy stack in three days.

Inside the system

What it does. How it's built.

Features

Unified metrics + logs + traces
AI-assisted incident timeline reconstruction
Cost-aware retention policies (hot → warm → cold)
Slack-native alerting + chatops
OpenTelemetry-first ingestion
PromQL + LogQL compatible query layer
Custom dashboarding with 22 widget types
On-call runbook embedded in alerts

Architecture

01Spring Boot ingestion gateway
02Kafka for buffered event streams (3 brokers, 18 partitions)
03ClickHouse cluster (3-node) for storage
04React + ECharts dashboards
05Custom query planner translating PromQL → SQL
06Deployed on AWS ECS Fargate
07S3 for cold storage + Athena for archival queries
08Slack bot built on the Bolt SDK

Stack

JavaSpring BootReactClickHouseKafkaDockerAWS ECSPrometheusOpenTelemetry

From the codebase

Annotated excerpts.

01 · Ingestion gateway: routes events to Kafka topics with backpressure-aware sampling.

ingest/EventRouter.javajava

@Component
public class EventRouter {
    private final KafkaTemplate<String, Event> kafka;
    private final SamplingPolicy sampler;

    public CompletableFuture<Ack> route(Event event) {
        if (!sampler.accept(event)) {
            return Ack.dropped(event.id());
        }

        var topic = switch (event.kind()) {
            case METRIC -> "metrics.raw";
            case LOG    -> "logs.raw";
            case TRACE  -> "traces.raw";
        };

        return kafka.send(topic, event.tenantId(), event)
            .thenApply(r -> Ack.accepted(event.id(), r.getRecordMetadata().offset()))
            .exceptionally(ex -> Ack.failed(event.id(), ex));
    }
}

02 · ClickHouse schema tuned for cardinality + compression.

schema/metrics.sqlsql

CREATE TABLE metrics (
    tenant_id   LowCardinality(String),
    metric_name LowCardinality(String),
    timestamp   DateTime64(3),
    value       Float64,
    labels      Map(LowCardinality(String), String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (tenant_id, metric_name, timestamp)
TTL timestamp + INTERVAL 30 DAY TO VOLUME 'cold',
    timestamp + INTERVAL 90 DAY DELETE
SETTINGS storage_policy = 'tiered';

What the client said

We were burning $14k/month on a managed observability stack. Ali designed and shipped a self-hosted alternative that's now cheaper, faster, and easier to query. Paid for itself in 60 days.

Priya Nair

Head of Engineering · Ledgerline

Other projects

Continue browsing

Fintech · Treasury

VAULTPAY

→

Programmable treasury for modern fintech.

AI · Developer Tools

HELIX.ai

→

An AI copilot that understands your codebase.

Have a project like this in mind? Let's talk.

Send me a brief and I'll respond within 24 hours.

Get in touch Back to home