Uber engineering blog

Try:

Uber6 days ago

Raising the Bar on ML Model Deployment Safety

Uber describes enhancements to its Michelangelo ML platform that raise deployment safety across the ML lifecycle: explicit data and feature validation, standardized model reports, mandatory backtesting and shadow testing, controlled rollouts with automatic rollback, and continuous monitoring via the Hue observability stack. The platform also includes a safety-scoring system integrated with CI/CD to track adoption; future work covers GenAI-assisted code checks, semantic drift detection for embeddings, and expanded truthfulness/bias monitoring.

Uber1 week ago

Enabling Deep Model Explainability with Integrated Gradients at Uber

Uber’s Michelangelo team implemented Integrated Gradients (IG) to provide high-fidelity local feature attributions for deep models across TensorFlow and PyTorch. The post covers the IG method and validation, architecture and model-wrapper design, handling categorical embeddings and multi-layer attributions, YAML-driven pipeline and notebook integration, robustness to data drift, and scaling IG computation via Ray for production use.

Uber2 weeks ago

Rebuilding Uber’s Apache Pinot™ Query Architecture

Uber describes rebuilding its Apache Pinot query architecture by replacing a Presto-over-Pinot layer (Neutrino) with a simpler design that uses Pinot’s engines (including a new MSE Lite Mode) plus a lightweight Cellar proxy. The changes aim to simplify semantics, improve resource isolation and scalability, support M3QL/time-series use cases, and preserve low-latency, high-QPS OLAP query performance.

Uber1 month ago

Cadence Workflow Joins the Cloud Native Computing Foundation

Uber announces that Cadence, its open-source, fault-tolerant, and highly scalable workflow orchestration engine, has joined the Cloud Native Computing Foundation. The post highlights Cadence’s capabilities (managing distributed state, retries, scaling, failure recovery), its enterprise-readiness improvements (scalability, reliability, multitenancy, portability), the move to CNCF community infrastructure (Slack, GitHub project boards, roadmaps, meetings), and its intended role in durable orchestration including AI workloads.

Uber1 month ago

How Uber Standardized Mobile Analytics for Cross-Platform Insights

Uber standardized mobile analytics by defining cross-platform event types (tap, impression, scroll), encapsulating emission logic in AnalyticsBuilder classes, adding standardized metadata (app-level, event-type, standard surface), sampling 0.1% of sessions to log all events, running a pilot for validation, and migrating legacy events with tooling. The changes improved data consistency, cross-platform parity, developer ergonomics, and data quality, with future plans for componentization to further simplify naming and lifecycle management.

Uber1 month ago

Uber’s Strategy to Upgrading 2M+ Spark Jobs

Uber migrated its large Spark estate (millions of daily applications) from Spark 2.4 to Spark 3.3 by rebasing their Spark fork, resolving ecosystem dependencies (python/scal a/parquet), applying automated code transforms (polyglot piranha), and building a safe shadow-testing framework (Iron Dome) with orchestration (Cadence). The migration yielded broad performance and cost improvements and enabled adoption of k8s and JDK17.

Uber1 month ago

Adding Determinism and Safety to Uber IAM Policy Changes

Uber built an internal Policy Simulator integrated into its Unified Security Console to preview the impact of IAM policy changes before deployment. The simulator fetches recent access logs (from M3/Hive → Pinot), replays them on two local authorization engines (reference and proposed), and compares results to report potential access changes. The system uses Cadence for workflow orchestration, optimizes ingestion for low-latency queries, and aims to extend support to ABAC and automated policy unit tests.

Uber1 month ago

Open-Sourcing Starlark Worker: Define Cadence Workflows with Starlark

Uber open-sources Starlark Worker, a Cadence-based workflow execution platform that uses the Starlark scripting language to define and run workflows. The architecture combines Cadence's orchestration with a Go-implemented Starlark interpreter, enabling serverless, multi-tenant workflow execution with domain-specific function extensions. The post covers use cases (including ML workloads via Michelangelo and Ray), deployment models, and the project’s GitHub release.

Uber2 months ago

Building Uber’s Data Lake: Batch Data Replication Using HiveSync

Uber describes HiveSync, an event-driven service that replicates Hive/HDFS data from a primary to a secondary region for disaster recovery and active analytics. The post covers architecture (control vs data plane), event logging (Hive Metastore hook → MySQL), job lifecycle (FSM), small-job RPC vs DistCp-on-YARN for large copies, ordering/locking via a DAG manager, sharding for horizontal scale, and Data Reparo/OTRS to ensure cross-region consistency and fast on-boarding at massive scale.

Uber2 months ago

Controlling the Rollout of Large-Scale Monorepo Changes

Uber added incremental cross-service rollout orchestration to its continuous deployment system to limit blast radius from monorepo commits that affect many microservices. They implemented a commit-level state machine that aggregates deployment signals across cohorts (using service tiering), flags commit issues, and gates progression. They built a simulator to tune parameters (targeting unblocking within 24 hours) and validated the approach in production. The feature is also used for incremental rollouts of many identical stateless ML-serving services.

Uber2 months ago

How Uber Serves over 150 Million Reads per Second from Integrated Cache with Stronger Consistency Guarantees

Uber describes improvements to CacheFront (Docstore's integrated cache) to provide stronger consistency while scaling to >150M reads/sec. Key changes include returning the exact set of row keys and monotonic session timestamps from the MySQL-based storage engine so the query layer can synchronously or asynchronously invalidate Redis cache entries (overwriting them with invalidation markers), continued use of Flux (CDC) and Lua-based deduplication, and a Cache Inspector to measure staleness. These changes enabled higher cache hit rates (99.9%) and longer TTLs for some tables.

Uber2 months ago

Lightweight Office Infrastructure: Transitioning from Backbone to SD-WAN

Uber Engineering describes replacing a centralized PoP/backbone office network with a decentralized, cloud-native SD-WAN (with ZTP and IaC templates), enabling direct-to-internet DIA links, reduced latency, faster global deployments, tighter automation and observability integration, and plans for AI-driven proactive incident detection and self-healing.

Uber2 months ago

Forecasting Models to Improve Driver Availability at Airports

Uber Engineering describes three airport-specific forecasting systems (Estimated Time to Request, Earnings Per Hour via Deep GMM, and Driver Deficit Forecasting using a Transformer-encoder) built to predict short-horizon marketplace signals. The post details real-time data ingestion and feature pipelines (Apache Flink, Spark, Cassandra), modeling choices for probabilistic and time-series forecasts, production challenges (label quality, delayed features), and integrations that surface signals in the driver app and a driver-summoning service to improve airport availability and rider experience.

Uber2 months ago

Locking Down the Fleet: Encryption at Rest and Disk Isolation at Scale

Uber migrated Odin from a shared RAID0 filesystem to per-workload logical volumes using LVM and block-level encryption with LUKS/dm-crypt (managed via cryptsetup). They adopted thick provisioning, built control-plane-driven disk allocation (DDMS) plus a host-agent extension loop, chose not to support in-place shrinking, optimized cryptsetup/dm-crypt behavior to avoid IO stalls, tuned encryption queues for different workloads, and prepared integration via the CSI interface for future Kubernetes adoption.

Uber2 months ago

uReview: Scalable, Trustworthy GenAI for Code Review at Uber

Uber’s uReview is an AI-powered code review platform that uses a modular, multi-stage GenAI system to generate, filter, validate, and deduplicate code review comments, with Fixer proposing actual code changes. It evaluates LLMs and prompts across a large usage footprint (65k diffs per week) and integrates into Uber’s CI/code-review workflows. The architecture relies on prompt chaining, per-assistant grading, semantic filtering, and data streaming to Apache Hive via Apache Kafka. It is deployed across Uber’s monorepos (Go, Java, Android, iOS, Typescript, Python) and leverages multiple AI models (Claude-4-Sonnet, gpt-4.1, o4-mini-high, llama-4, etc.), with internal tooling like Phabricator and GitHub Copilot. The post covers design philosophy (precision over volume), evaluation, and lessons learned for scalable GenAI deployments.

Uber3 months ago

From Restaurants to Retail: Scaling Uber Eats for Everything

Uber describes INCA, a scalable inventory-and-catalog platform that ingests retailer CSVs (SFTP), enriches and validates sparse data (online/offline enrichers, LLMs, ML flows), publishes merged entities via a prioritized merge logic, snapshots/version-controls published catalogs, and indexes data for search and queries. Key components include Protobuf schemas, Starlark CSV mappers, Cadence workflows, and the pipeline split (ingest/storage/publish/index) to meet high throughput and low-latency targets at massive scale.

Uber3 months ago

PerfInsights: Detecting Performance Optimization Opportunities in Go Code using Generative AI

PerfInsights is an Uber Engineering project that automatically detects performance antipatterns in Go services by combining production CPU/memory profiling with GenAI-powered static analysis, validated by LLM juries and LLMCheck. It aims to surface high-impact optimizations, reduce false positives, and integrate results into automated downstream tasks and CI/CD for Go back-end services.

Uber3 months ago

Unlocking Financial Insights with Finch: Uber’s Conversational AI Data Agent

Uber's engineering blog describes Finch, an AI-powered Slack data agent that converts natural language queries into SQL to retrieve real-time financial data across Uber's data platforms. Finch uses generative AI, retrieval-augmented generation, and self-querying agents orchestrated via LangChain LangGraph, with metadata stored in an OpenSearch index. It integrates with data sources like Presto, IBM Planning Analytics, and Oracle EPM, and exports results to Google Sheets. The architecture emphasizes security (RBAC), modular agents, and real-time Slack interactions via the Slack SDK and Slack AI Assistant APIs.

Uber3 months ago

How Uber Processes Early Chargeback Signals

Uber describes how it ingests and processes early chargeback/fraud signals (TC40 and SAFE) via SFTP/APIs/webhooks, normalizes and publishes them to Apache Kafka and Apache Hive, builds near-real-time features via a streaming pipeline, deduplicates using Redis, and runs machine learning risk models and rule-based decisioning to trigger actions (e.g., penny-drop verification). The system processed millions of signals and provides earlier fraud detection (on average 4–5 days before chargebacks).

Uber4 months ago

Reinforcement Learning for Modeling Marketplace Balance

Uber models ride-matching as an infinite-horizon MDP and uses reinforcement-learning techniques (a DQN-inspired temporal-difference value-function estimator trained offline) to produce value signals that nudge online matching toward higher-value states. The system incorporates geospatial smoothing via geo-embeddings and contrastive loss, Monte Carlo ground truths for evaluation, and production practices (weekly retraining, anomaly detection, observability, and gated signal integration). Deployments across 400+ cities (and AV fleet use) yielded measurable improvements (driver earnings and fewer rider cancellations); next steps include real-time TD pipelines and exploring direct policy learning.