Netflix engineering blog

Try:

Netflix1 month ago

Data as a Product: Applying a Product Mindset to Data at Netflix

Netflix proposes treating data as first-class products and outlines core principles for doing so: clear purpose, defined users, measurable value/quality (via a Data Health effort), thoughtful design and documentation, lifecycle management, clear ownership, and building trust through reliability. The goal is to manage datasets, metrics, dashboards, and ML inputs with the same rigor as consumer products so they better enable business decisions and AI-driven innovation.

Netflix1 month ago

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine

Netflix rewrote Maestro's internal flow engine to a lightweight, stateful actor model with in-memory flow state and Java 21 virtual threads, removed external distributed job queues, introduced flow-group partitioning and a database-backed in-memory queue (transactional outbox style). The redesign reduced step and workflow latencies by ~100x, simplified architecture, eliminated race conditions, and enabled a careful migration using a test framework and staged rollout.

Netflix1 month ago

Building a Resilient Data Platform with Write-Ahead Log at Netflix

Netflix built a generic Write-Ahead Log (WAL) service to provide durable capture and reliable delivery of data mutations across use cases: delayed queues/retries, cross-region replication (e.g., EVCache), and multi-partition/multi-table mutations using two-phase commit semantics. WAL is pluggable (Kafka or SQS backends), deployed via Netflix's Data Gateway infra (shards/namespaces, mTLS, Envoy), uses DLQs and durable storage, and centralizes retry/backoff, ordering, and replay capabilities to improve resilience and developer efficiency.

Netflix1 month ago

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Netflix describes how it scaled its Muse analytics application to handle trillions of rows by changing the data model to use HyperLogLog sketches for distinct counts, migrating some query patterns from Druid to in-memory Hollow feeds, tuning Druid ingestion and query parameters, and validating changes via parallel stacks, Jupyter-based tests, and in-app comparisons. These changes reduced latencies (p99) by roughly 50% while enabling new features like audience-based filtering and outlier detection.

Netflix1 month ago

Empowering Netflix Engineers with Incident Management

Netflix describes transforming incident management from a centralized SRE function into a democratized, company-wide practice. They evaluated build-vs-buy and adopted Incident.io, built a lightweight, standardized "paved road" process, integrated internal data to automate routing and pre-fill context, invested in education and change management, and saw rapid adoption that improved ownership, learning, and response consistency across their many microservices.

Netflix1 month ago

Behind the Streams: Building a Reliable Cloud Live Streaming Pipeline for Netflix. Part 2.

Technical overview of Netflix's cloud native live streaming pipeline: venue-to-cloud acquisition, multi-region redundant ingest (SMPTE 2022-7 via AWS Elemental MediaConnect), cloud encoding (AVC/HEVC, HE-AAC, Dolby), a custom live packager with DRM integration, evolution from static S3 bucket origins to a media-aware live origin for performance, and an orchestration/control plane (Control Room) for automated provisioning, failover and operations.

Netflix2 months ago

Unlocking Dynamic Pages: The Evolution of Netflix’s Client-Server GraphQL APIs

Netflix describes how it evolved its client-server GraphQL APIs and an associated trigger/action system to enable server-driven dynamic page updates. They explain choosing mutations over queries/subscriptions due to client caching and infrastructure cost, schema changes (e.g., insertedSections) to leverage client cache normalization and avoid refetching expensive section payloads, and a trigger/action model where the server encodes context (actionId) to determine page modifications.

Netflix2 months ago

From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

The post outlines Netflix’s evolution from traditional metrics-focused data engineering to a Media ML Data Engineering specialization tailored for multi-modal media data and machine learning. It introduces the Media Data Lake, integrating vector storage (LanceDB) with Netflix’s Big Data Platform to centralize media assets, metadata, and ML-derived outputs such as embeddings. Key components include a Media Table, a standardized data model, a Pythonic Data API, UI exploration tools, and an online/offline architecture with distributed GPU/CPU batch inference. An initial “data pond” targets video/audio datasets sourced from the AMP asset and annotation system to enable early-stage model training, evaluation, and research. Media tables enable use cases like localization quality metrics, HDR restoration, narrative/content understanding (e.g., content safety signals), and multimodal vector search for experimentation and benchmarking.

Netflix2 months ago

ML Observability: Bringing Transparency to Payments and Beyond

Netflix describes an ML observability framework developed for payment routing: designing a logging/schema approach, defining stakeholder-focused online metrics, and applying explainability (notably SHAP) to debug and justify model-driven routing decisions (including complex multi-layer/bandit policies). The work reduced operational complexity, improved approval rates, and is being generalized via a standard data schema for reuse across models.

Netflix2 months ago

Accelerating Video Quality Control at Netflix with Pixel Error Detection

Netflix built a bespoke neural-network-based system to detect pixel-level 'hot' artifacts in video by processing full-resolution five-frame windows, outputting dense pixel-error maps which are thresholded and clustered (connected-component labeling) to return centroids. They trained initially with a synthetic pixel error generator (symmetrical and curvilinear artifacts) and iteratively fine-tuned on real footage; inference runs in real time on a single GPU, dramatically reducing manual QC effort.

Netflix3 months ago

Return-Aware Experimentation

Netflix presents two research papers introducing "return-aware experimentation": (1) a model of experimentation as a resource-allocation problem with dynamic programming and portfolio-level optimization to maximize long-run returns, and (2) an empirical methodology to evaluate decision rules across many weak A/B experiments, addressing the "winner's curse" via data-splitting. The work focuses on experimental design tradeoffs, how to run experiments at scale, and real-world application at Netflix to choose proxy metrics and decision rules used in practice.

Netflix3 months ago

Behind the Streams: Live at Netflix. Part 1

Netflix describes the architecture and operational practices behind three years of building Live streaming: ingest via distributed broadcast operations centers, cloud-based redundant transcoding/packaging using AWS Elemental services and a custom packager/Live Origin, delivery at scale via Netflix's Open Connect CDN, HTTPS-based playback with AVC/HEVC and segment templates, and extensive real-time observability and SRE practices (Atlas, Mantis, Lumen, Kafka, Druid, synthetic load tests, failure injection, and graceful degradation).

Netflix3 months ago

Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow

Netflix Tudum migrated its read-path from a CQRS/event-driven pipeline (third-party CMS → ingestion → Kafka → consumer → Cassandra + KVDAL near cache) to an embedded in-memory co-located compressed object store (RAW Hollow). The change eliminated Kafka and the external KV/cache layer for Tudum read services, gave strong read-after-write consistency for previews, reduced page construction latency (~1.4s → ~0.4s) and dramatically reduced I/O and cache complexity (three years of unhydrated data ≈ 130MB compressed).

Netflix4 months ago

Driving Content Delivery Efficiency Through Classifying Cache Misses

Netflix's Open Connect team outlines a data-engineering framework for classifying cache misses (content vs health) to measure and improve content delivery efficiency. The system joins steering playback manifest logs with OCA server logs (emitted via Kafka across AWS regions), consolidates and enriches data, performs streaming window-based joins to compute miss metrics in near real time, and uses those metrics for monitoring, alerts, and remediation.

Netflix4 months ago

AV1 @ Scale: Film Grain Synthesis, The Awakening

The post explains how Netflix leverages the AV1 Film Grain Synthesis feature to model and reapply film grain in two stages—using an auto-regressive pattern model to capture spatial noise correlations and a piecewise linear scaling function to adapt grain intensity to luminance. By denoising content before encoding and transmitting only model parameters alongside compressed frames, the approach preserves artistic grain while achieving up to a 66% bitrate reduction on grainy titles. Playback reconstructs the grain via block-based synthesis, masking artifacts and improving visual quality on consumer devices.

Netflix4 months ago

Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix

Netflix describes UDA (Unified Data Architecture), a knowledge-graph-first system that lets teams model domain concepts once (via the Upper metamodel) and project those models into concrete schemas and data containers (GraphQL, Avro, Iceberg, SQL, Java). UDA uses RDF/SHACL-based representations, mappings, and projections to automate schema generation, data movement (Data Mesh, CDC), and discovery, and is already powering Primary Data Management (PDM) and an operational reporting tool (Sphere).

Netflix5 months ago

FM-Intent: Predicting User Session Intent with Hierarchical Multi-Task Learning

The article presents FM-Intent, a hierarchical multi-task learning extension to Netflix's foundation model that first predicts multiple session-level intent signals (short- and long-term proxies) via a Transformer encoder and attention-based aggregation, and then uses those intent embeddings to improve next-item recommendation. Offline experiments on sampled Netflix engagement data show statistically significant improvements over baselines (including TransAct and their prior FM model); the model also produces useful intent embeddings for clustering and downstream applications.

Netflix6 months ago

Behind the Scenes: Building a Robust Ads Event Processing Pipeline

Netflix describes the design and evolution of its ads event processing pipeline: from a pilot with Microsoft using VAST and opaque tokens, to a centralized Ads Event Publisher and an Ads Metadata Registry (key-value persistence). The pipeline uses Kafka for ingestion, Apache Flink jobs for metrics and sessionization, and Apache Druid for OLAP metrics, supporting frequency capping, billing, reporting, vendor integrations (DV/IAS/Nielsen), GDPR compliance, and future features like conversion API and QR-code events.

Netflix6 months ago

Measuring Dialogue Intelligibility for Netflix Content

The post describes Netflix’s Dialogue Integrity Pipeline for assessing dialogue intelligibility from on-set capture through final playback. The Audio Algorithms team implemented an intelligibility measurement based on Short-Time Objective Intelligibility (STOI/eSTOI), using a speech-activity detector to extract utterances and computing signal-to-noise ratios across speech frequency bands to produce per-utterance scores in [0,1]. Measurements are applied at scale alongside loudness meters to analyze dynamic range and its effects on intelligibility. To operationalize findings for creators, Netflix collaborated with Fraunhofer IDMT and Nugen Audio to adapt a machine-learning-based intelligibility library into cross-platform VST and AAX DAW plugins (DialogCheck) that provide real-time insights during mixing.

Netflix7 months ago

How Netflix Accurately Attributes eBPF Flow Logs

Netflix describes how it eliminated misattribution in eBPF TCP flow logs by attributing local IPs at the FlowExporter sidecar (using Metatron certificates and IPMan mappings on Titus) and building time-range ownership maps in FlowCollector nodes (broadcast via Kafka). This heartbeat/time-range approach enables accurate remote IP attribution, reduces attribution latency (1 minute vs 15), scales to ~5M flows/sec (30 c7i2xlarge instances), and handles cross-region and non-workload IPs (ELBs) appropriately.