Datadog engineering blog

Datadog1 week ago

Inside Husky’s query engine: Real-time access to 100 trillion events

A technical deep-dive into Datadog’s Husky event-store query engine designed for interactive querying over 100+ trillion events. The post explains Husky’s multi-service query path (planner, orchestrator, metadata service, reader), storage layout (fragments, row groups, text-search segments), execution model (iterator-based, lazy decoding), extensive pruning and multi-layer caches (result, blob-range, predicate), routing via shuffle sharding for affinity/isolation, and streaming partial results to mitigate latency tails. It includes metrics for cache hit rates and pruning effectiveness and notes future moves toward Arrow/Parquet/Substrait interoperability.

Datadog2 weeks ago

From hand-tuned Go to self-optimizing code: Building BitsEvolve

Datadog describes how manual, low-level Go optimizations produced measurable CPU and cost savings and how those lessons seeded BitsEvolve — an agentic, evolutionary system that uses LLMs, benchmarking, and observability data to automatically evolve faster code. The post covers concrete wins (ASCII fast-paths, CRC32, SecureWrite), the design of BitsEvolve, integration with production observability (Live Debugger, SLOs, Eppo), and a Simba prototype for Rust SIMD integration with Go, and shares lessons about realistic benchmarks and automating performance work.

Datadog1 month ago

Scaling down to speed up: How we improved efficiency of live process metrics by 100x

Datadog re-architected its real-time Processes and Containers telemetry pipeline by switching from tenant-wide 2s data collection to a host-subscription model (only hosts currently in view, up to ~50 per user). They also unified sorting to use 10s metrics, published subscriptions over Kafka to an intake service with an in-memory TTL cache, and moved filtering upstream to intake. These changes cut real-time ingestion roughly 100x (peak messages/s from ~500k to ~5k), reduced live-server memory by ~85% and CPU by ~33%, lowered infrastructure footprint by ~98%, and reduced Datadog Agent CPU by ~2%.

Datadog2 months ago

Evolving our real-time timeseries storage again: Built in Rust for performance at scale

Datadog describes the design and implementation of Monocle, a new Rust-based real-time timeseries storage engine (LSM-tree) that unifies scalar and distribution metrics. Key innovations include shard-per-core workers, private memtables, a unified per-series cache, tiered compaction, a shared radix-tree aggregation buffer for DDSketch, and other optimizations that produced large improvements in ingestion and query performance. The post also covers prior generations, rollout experience, and future work on routing and indexing.

Datadog2 months ago

How Go 1.24's Swiss Tables saved us hundreds of gigabytes

Datadog investigated a Go 1.24 memory regression and discovered that Go 1.24's new Swiss Tables and extendible hashing dramatically reduce memory used by large maps compared with Go 1.23's bucket-based implementation. They measured and explained the runtime-level changes (control words, groups, higher load factor, elimination of overflow buckets), estimated memory for their shardRoutingCache map (3.5M entries), and observed ~500 MiB live-heap savings per process. They further reduced memory by shrinking a Response struct (removing unused fields and switching an enum to uint8), validated results with live heap profiles, discussed operational options (adjusting k8s limits, GOMEMLIMIT), and emphasized the value of metrics and collaborating with the Go community.

Datadog2 months ago

How we tracked down a Go 1.24 memory regression across hundreds of pods

Datadog engineers investigated a ~20% memory increase after upgrading services to Go 1.24. System RSS rose while Go runtime metrics remained stable. By examining /proc/[pid]/smaps, heap profiles, and using heapbench and git bisect, they traced the regression to a mallocgc refactor that removed an optimization causing unnecessary zeroing of large pointer-containing allocations, which committed more virtual pages to RAM. The issue was reported and fixed upstream (targeted for Go 1.25). The post explains the diagnostic steps and rollout implications; a follow-up covers Swiss Tables benefits.

Datadog3 months ago

How we built reliable log delivery to thousands of unpredictable endpoints

Datadog describes the architecture for Log Forwarding that reliably sends logs from Kafka to many external endpoints. They solve Kafka ordering and blocking issues by stager/planner/shipper services, storing grouped compressed "slice" files in cloud object storage, tracking slices in a metadata store, and using per-destination adaptive concurrency and retry logic to provide failure isolation, low latency, and scalable throughput.

Datadog3 months ago

How we scaled fast, reliable configuration distribution to thousands of workload containers

Datadog describes the design and evolution of a "context-publisher" system for distributing per-tenant configuration to tens of thousands of workload containers. After scaling and reliability issues with DB reads and Kafka invalidations, they moved to a hybrid model: periodically publish full RocksDB blobs to object storage and stream per-tenant updates via Kafka; workloads download a blob at startup, keep a local RocksDB replica, and apply streamed updates—reducing central DB load while providing low-latency, reliable updates.

Datadog3 months ago

Breaking up a monolith: How we’re unwinding a shared database at scale

Datadog describes a multi-year, three-phase program to split a large shared Postgres database into independently owned instances. The post explains how they defined domain boundaries, reduced direct access, used proxies and continuous replication for cutovers, built automation (OrgStore, PG Proxy, Rapid) to make migrations safe and scalable, tracked progress with DBM metrics and incidents, and shared lessons learned.

Datadog7 months ago

Achieving relentless Kafka reliability at scale with the Streaming Platform

Datadog built a Streaming Platform control plane over Kafka to enable multi-cluster, real-time reliability at massive scale. Key components include Streams (abstractions spanning clusters), a real-time coordinator called the Assigner for failovers and rebalancing, stream lanes and an enhanced commit log to avoid head-of-line blocking, enriched metadata for time-lag monitoring, and a custom high-performance Rust client (libstreaming) to handle scale and compression edge cases.

Datadog8 months ago

Husky: Efficient compaction at Datadog scale

Datadog describes Husky's storage and compaction design for massive-scale observability data. Writers emit small sorted columnar fragments to object storage and metadata is tracked in FoundationDB. Compactors perform streaming k-way merges using a custom columnar format with bounded row groups to control memory, apply size-tiered and locality (LSM-like) compaction within time buckets, and use per-fragment automata-derived regexes for predicate pruning. The design balances compaction cost, query latency, and parallelism to reduce object-store fetches, CPU and memory usage and achieved significant operational cost savings.

Datadog8 months ago

Unraveling a Postgres segfault that uncovered an Arm64 JIT compiler bug

Datadog engineers investigated Postgres processes crashing with segmentation faults and traced the root cause to incorrect JIT-generated code on Arm64. By reproducing the issue on Arm64 instances, building debug LLVM and Postgres, and stepping through assembly, they discovered that LLVM's RuntimeDyld allocated .text and .rodata sections too far apart (≈10.3 GiB), exceeding ADRP/relocation limits for AArch64 and producing broken branch tables. The team mitigated the problem by disabling JIT on Arm64 and worked upstream to land a Postgres patch that uses a fixed memory manager for ELF section allocation; fixes were released for multiple Postgres versions.

Datadog9 months ago

Effective habits of remote workers

A practical guide for remote workers at predominantly office-based companies, offering actionable habits: overcommunicate, maintain strong virtual presence (camera, eye contact, speaking up), proactively build relationships (virtual coffee, team events), balance visibility without being annoying, schedule regular in-person time, and structure your calendar to avoid communication fatigue. The article is experience-based advice from a Datadog employee and includes a brief hiring note at the end.

Datadog10 months ago

How we use formal modeling, lightweight simulations, and chaos testing to design reliable distributed systems

Datadog engineers describe how they used formal modeling (TLA+), lightweight simulations (SimPy), and chaos testing to design and validate Courier, a multi-tenant message queuing service. The article covers Courier's requirements (multi-tenancy, at-least-once delivery, graceful degradation, horizontal scalability), the TLA+ model and model-checking results (NoLostMsgs), simulations of FoundationDB cluster failures, findings from chaos and performance tests, and implementation trade-offs (FoundationDB hot-shard workarounds, adding a sequencer). It also discusses challenges and next steps around deterministic simulations and keeping models up to date.

Datadog11 months ago

How we built a Ruby library that saves 50% in testing time

Datadog engineers describe building a Ruby native extension to collect per-test "test impact" data for a selective test runner (Intelligent Test Runner). They evaluated existing approaches (Coverage module, TracePoint), then implemented a performant C-extension using interpreter and allocation events, added optimizations (rb_profile_frames, pointer caching, allocation-based class tracking) to handle edge cases like "code-less" classes, and reduced median overhead to about 25%, enabling substantially faster CI by skipping irrelevant tests. The tool is open source and integrated into datadog-ci.

Datadog1 year ago

Timeseries indexing at scale

Datadog describes reengineering its timeseries indexing service: moving from an automatically-generated selective-index approach to an always-on inverted index (metric;tag -> timeseries IDs) backed by RocksDB, adding intranode sharding to parallelize query work, and rewriting the service from Go to Rust. These changes improved worst-case query predictability, enabled much higher-cardinality queries, reduced tail latency and timeouts, and lowered cost.

Datadog1 year ago

How we migrated our static analyzer from Java to Rust

Datadog engineers explain migrating their static analyzer from Java to Rust to leverage Tree-sitter’s Rust ecosystem, improve parsing and rule-execution performance, and remove the JVM dependency. The rewrite (including moving JS rule execution to deno-core and using Rust crates like rayon, serde_yaml, reqwest) produced ~3× faster scans and ~10× memory reduction, allowed embedding in IDEs, and required the team to learn Rust-specific concepts and tooling. The post covers architecture, library mappings, lessons learned, and next steps in the Tree-sitter ecosystem.

Datadog1 year ago

.NET Continuous Profiler: Memory usage

Part 4 of Datadog’s .NET continuous profiler series: a technical deep dive into memory profiling. Describes user-facing features (GC CPU in flame graphs, allocated-memory profiles, sample live objects) and the implementation: reading GC thread CPU from the OS, sampling allocations via AllocationTick events, reconstructing allocation call stacks and types, tracking surviving objects with Weak handles via ICorProfilerInfo13 (.NET 7+), and the statistical/upscaling challenges (fixed-100KB sampling vs Poisson-based approaches). Discusses trade-offs to minimize production overhead and interactions with CLR versions and future plans.

Datadog1 year ago

How we built the Datadog heatmap to visualize distributions over time at arbitrary scale

Datadog engineers describe how they built a scalable heatmap visualization for distribution metrics (using DDSketch), covering data representation (bins/float32), bucket alignment, color-scale design (linear, equalized, hybrid), rendering trade-offs (fillRect vs setImageData, per-pixel rendering), interaction heuristics, and real-world examples where heatmaps reveal behaviors percentiles hide.

Datadog1 year ago

How we brought Datadog's data visualization to iOS: A focus on performance

Datadog engineers describe building DogGraphs, a native Swift/SwiftUI library for iOS data visualizations. The article is a technical deep-dive into profiling and performance fixes: reducing unnecessary SwiftUI body evaluations via structural changes to the result-builder pipeline, batching many shapes by style to minimize rendering passes, splitting Views to narrow dependencies, avoiding expensive work inside View bodies, and reducing @Published churn. The result is faster, smoother rendering for complex graphs on iOS 14-compatible devices.