Yelp engineering blog

Try:

Yelp4 months ago

Exploring CHAOS: Building a Backend for Server-Driven UI

A technical deep dive into Yelp’s CHAOS server-driven UI backend: how GraphQL requests are routed via an Apollo-federated Python subgraph (using Strawberry), how multiple backend services expose REST APIs that build CHAOS configurations, and how the backend composes views using builder/provider patterns, Python dataclasses, asynchronous data loading (asyncio), error-handling wrappers, view flows, and view placeholders to enable dynamic SDUI content across iOS, Android, and web clients.

Yelp5 months ago

Revenue Automation Series: Testing an Integration with Third-Party System

Yelp describes a testing and integration strategy for its Revenue Data Pipeline that introduced a parallel staging pipeline to avoid Redshift connector latency by publishing to AWS Glue and querying via Redshift Spectrum. The team implemented daily and monthly SQL-based integrity checks, schema validation via REST API calls, and two upload methods (REST APIs and SFTP), choosing SFTP for reliability and larger file sizes; they also discuss test-data generation, logging/monitoring, external support, and future automation improvements.

Yelp6 months ago

Nrtsearch 1.0.0: Incremental Backups, Lucene 10, and More

Nrtsearch 1.0.0 adds incremental backup support using AWS EBS and S3 storage, upgrades its indexing engine to Lucene 10, and integrates HNSW-based approximate nearest neighbor search. The release also includes scatter-gather query improvements, enhanced memory management, and various performance optimizations.

Yelp6 months ago

Journey to Zero Trust Access

Yelp describes their shift from a legacy VPN to a Zero Trust Access model to secure a fully distributed workforce. The post details their architecture, including SAML-based SSO integration and Devbox remote development servers, to provide consistent, authenticated access to internal resources.

Yelp8 months ago

Revenue Automation Series: Building Revenue Data Pipeline

Yelp describes building a revenue data pipeline to supply a third‑party revenue recognition SaaS (REVREC). They translated accounting requirements to engineering terms, performed data gap analysis against MySQL/order‑to‑cash systems, evaluated architectures (MySQL+Python batch, Data Warehouse+dbt, event streams, Data Lake+Spark ETL) and chose Data Lake + Spark ETL. Implementation details include an internal spark‑etl framework organizing Spark features, PySpark transformations and UDFs, snapshotting tables to S3, dependency YAML, spark‑submit runs, and JupyterHub for debugging. The post also outlines future improvements around data interfaces and simplified data models.

Yelp9 months ago

Search Query Understanding with LLMs: From Ideation to Production

Yelp describes how it integrated LLMs into query understanding for search (segmentation, spell correction, review-highlight phrase generation). The post covers formulation and RAG augmentation, prompt engineering and POCs (offline + A/B), building golden datasets, fine-tuning smaller models for cost/latency, pre-computing head-query outputs into datastores/key-value caches, and serving real-time inference (BERT/T5) to scale to production.

Yelp9 months ago

Enhancing Neural Network Training at Yelp: Achieving 1,400x Speedup with WideAndDeep

Yelp optimized training of Wide & Deep pCTR models by building ArrowStreamServer (PyArrow) to stream Parquet data from S3 into TensorFlow Datasets and by switching distributed training from TensorFlow’s MirroredStrategy to Horovod. Combined IO and distributed-training changes (plus resource tuning and Spark integration via KerasEstimator/Runner wrappers) produced ~1,400x speedup on large datasets.

Yelp10 months ago

Revisiting Compute Scaling

Yelp migrated Kubernetes node autoscaling from an internal ASG-based autoscaler (Clusterman) to AWS Karpenter. The move was driven by Clusterman’s difficulties with setpoint tuning, instance-type inflexibility, interval-based slowness, and operational overhead. Karpenter provided event-driven, faster scaling, better bin-packing, mixed spot/on-demand handling, nodepools, TTLs, and metrics; migration required adjustments (node replacement strategy, PDBs, dashboards) and fixes around ephemeral storage, launch templates, blockDeviceMappings and kubelet config alignment. Karpenter improved spending efficiency by ~25%.

Yelp11 months ago

Revenue Automation Series: Modernizing Yelp's Legacy Billing System

Yelp ran a multi-year, cross-functional program to modernize its legacy billing system (removing an invoice-obviation model and enabling one-to-many payment-to-invoice relationships) so it could integrate with third-party revenue automation. The post describes the execution plan — requirement gathering (using Domain Driven Design concepts), target architecture choices, project planning with cross-functional 'tiger' teams, incremental rollouts (single-group A/B-style rollout), thorough user acceptance testing, and layered observability (alerts/logging, integrity checkers, dashboards). The program involved 50+ people over more than two years and reached 100% adoption by July 2024; the focus of the article is on how the initiative was executed rather than implementation details.

Yelp1 year ago

Loading data into Redshift with DBT

Yelp replaced Spark-based S3->Redshift batch loads with dbt-driven copies that use AWS Redshift Spectrum to read from the Data Lake (Glue/S3). This simplified schema changes and backfills, added deduplication (via dbt_expectations), improved data consistency, and dramatically reduced runtime (from ~2 hours to ~10 minutes) across multiple datasets.

Yelp1 year ago

How we improved our Android navigation performance by ~30%

Yelp’s Core Android team migrated the Consumer app from many activities to a single-activity, fragment-based navigation model. They benchmarked views vs fragments, evaluated third-party libs and Jetpack Navigation, implemented plain fragments with a SingleActivityNavigator abstraction and DI-based fragment retrieval across Gradle modules, improved deeplink handling, and measured an average ~30% navigation performance gain in production.

Yelp1 year ago

Migrating in-place from PostgreSQL to MySQL

Yelp migrated its Reservations service from a Postgres database (used only by this service) to a Yelp-standard MySQL instance without downtime by running both DBs in parallel, replacing Postgres-specific features (arrays, triggers, extensions) with application-level logic and new tables, publishing events to AMQP/rabbitmq from post-commit hooks, grouping transaction changes with UUIDs, and carefully rolling out read/write routing in Django (custom models, querysets, router and middleware). The rollout addressed issues like autoincrement/sequence differences, ProxySQL query-digest/savepoint problems, bulk vs object-level writes, large backfills, and analytics consumers before fully switching off Postgres.

Yelp1 year ago

Boosting ML Pipeline Efficiency: Direct Cassandra Ingestion from Spark

Yelp improved its ML Feature Store publishing workflow by enabling direct writes from Spark (the in-house PySpark ETL) to Cassandra using the open-source Spark Cassandra Connector. To protect production Cassandra clusters they disabled Spark batching for Cassandra writes, implemented static rate-limiting and concurrency limits (including distributed locks via Zookeeper), and tuned executor/core calculations for Spark dynamic resource allocation. The change removed a dependency on Yelp’s Data Pipeline and Avro-publishing path, yielding ~30% ML infrastructure cost savings and ~25% improvement in developer efficiency. Future work discussed includes using Spark Bulk Analytics to bypass Cassandra native transport limits.

Yelp1 year ago

dbt Generic Tests in Sessions Validation at Yelp

Yelp describes using dbt generic tests to validate and accelerate development of their Sessions Data Mart. The post covers challenges with manual SQL checks, how dbt tests (and packages like dbt-expectations), Jinja-based macros, and the store_failures feature help run reproducible, parameterized tests across Redshift and S3+Athena, and how tests are categorized and executed (dev/prod/daily) using tags and dbt commands.

Yelp1 year ago

Implementing multi-metric scaling: making changes to legacy code safely

Yelp enabled multi-metric horizontal autoscaling in PaaSTA by changing the autoscaling API to accept multiple metric providers, adding stricter validation (paasta validate), using snapshot tests of the generated Prometheus Adapter config to ensure no behavioral regression, and improving Grafana dashboards and alerts. The rollout was staged (supporting both old and new config formats, migrating soaconfigs, then removing the old format) and completed with no downtime.

Yelp1 year ago

Fine-tuning AWS ASGs with Attribute Based Instance Selection

Yelp improved its autoscaling by switching AWS Auto Scaling Groups to attribute-based instance selection (ABS), enabling ASGs to choose eligible EC2 instance types from attribute constraints rather than hardcoded lists. This reduced operational overhead, expanded eligible spot capacity, and lowered costs (up to 37% in one cluster). They evaluated spot allocation strategies (lowest-price, capacity-optimized, price-capacity-optimized), adopted price-capacity-optimized broadly, updated scripts to query instance attributes at runtime (via awscli), adjusted IaC (terraform), and smoothed migration from their in-house autoscaler Clusterman to Karpenter.

Yelp1 year ago

Moderating Inappropriate Video Content at Yelp

Yelp describes its video-moderation pipeline: uploaded videos are checked via a matching service (similarity hashes) and then scored by a deep-learning, multi-label classifier applied to sampled frames. Videos above thresholds are hidden and sent to human reviewers; false positives can be restored. To handle large video sizes and enable near-real-time moderation, Yelp reduces inference load through pre-emptive blocking of suspicious uploaders and selective frame sampling, and reuses its photo moderation model to minimize development cost.

Yelp1 year ago

Phone Number Masking for Yelp Services Projects

Yelp describes an in-house phone masking system for its Services Marketplace that uses telephony API integration and a "masking session" data model to route calls and SMS through proxy numbers. The engineering focus is on minimizing the pool of purchased proxy numbers via recycling and reuse strategies (culminating in separate customer/business proxy pools) to make costs and resource needs scale efficiently while preserving privacy and conversation continuity.

Yelp1 year ago

CHAOS: Yelp's Unified Framework for Server-Driven UI

Yelp built CHAOS, a unified server-driven UI framework that delivers versioned UI configurations to web (React), iOS, and Android clients via a GraphQL API. Due to compatibility and fragment/versioning issues with GraphQL types, components and actions are modeled as versioned JSON/REST objects served by Python microservice backends; the GraphQL layer (Apollo Server) dispatches queries and passes platform/app-version metadata. The post explains design tradeoffs (GraphQL vs REST), component/action versioning, YAML-based availability configuration, client libraries, use cases, and future plans (automated previews, no-code editing, machine-learning based optimization).

Yelp1 year ago

Keeping track of engineering-wide goals and migrations

This post describes Yelp's EE Metrics platform — an internal system for collecting engineering metrics, running audits, surfacing team health reports, and tracking org-wide required migrations. It explains the architecture (backend + frontend + events pipeline), how audits and scores are computed and presented, the governance process for designating required migrations, and examples of how the system helped teams (e.g., surfacing test/experiment issues). The goal is to increase visibility into technical debt and prioritize cross-team engineering work.