Booking engineering blog

Booking4 months ago

Tactical Coding Assistants

The author argues that current LLM coding assistants excel at tactical programming (quick, direct implementations) but can generate complexity and technical debt if left unchecked. Developers should remain the strategic architects — constraining prompts and reviewing generated code. The post demonstrates a "no leash" extension to a Python report generator and a refactor using a Strategy/Command-like abstraction, and it references LLMs (Gemini 2.5 Pro, Claude 3.7), prompt engineering, REST scaffolding, and unit tests.

Booking4 months ago

Unlocking the Power of Customization: How Our Enrichment System Transforms Recommendation Data…

Booking.com's engineering post explains a new Enrichment System for their Recommendation Platform that uses GraphQL/field-mask-driven, on-demand enrichments to decouple enrichment from prediction. The system improves isolation, reusability and performance (up to 100k enrichments/sec, 99.99% availability) and is observable via dashboards; example enrichments include pricing, wishlist counts, recommended images and tracking actions.

Booking6 months ago

Anomaly Detection in Time Series Using Statistical Analysis

Booking describes Granomaly, a small service that computes prediction ranges for time-series business metrics using simple statistical techniques (z-score, standard deviation, percentiles, sliding windows) and writes upper/lower bounds back to Graphite. Grafana consumes those ranges for alerting. The article covers outlier-exclusion via z-score normalization, offset/correction/adjustment-factor metrics to improve alerts, a simulation page for fast tuning, and operational lessons about interpreting anomalies and breaking metrics down for diagnosis.

Booking8 months ago

Fitting Scrum for Software Development — Part II

A practical guide on adapting Scrum for software development: break PBIs by acceptance criteria (not implementation activities), use feature flags to keep stories independently shippable, improve board visibility with colored JIRA cards, and adopt engineering practices like trunk-based development and continuous delivery. Also covers ticket naming and balancing features, bugs, and technical improvements.

Booking9 months ago

Use Compression, Luke: Cut 20% of the Cloud Cost with a Single Code Change!

Booking.com SREs reduced cross‑AZ data transfer and cut ~20% of cloud cost by enabling and testing gRPC compression in Grafana Mimir on AWS EKS. They evaluated zstd (better compression but high memory/CPU tradeoffs) and s2 (good compression with lower memory/CPU impact), measured network/latency/CPU/memory tradeoffs, and contributed compression support upstream (s2 included in Mimir 2.15.0).

Booking9 months ago

Quick Steps for a Scrum Team to Improve the Process

Practical guidance for Scrum teams: keep stand-ups concise by "walking the board," prioritize and visibly flag blockers (use JIRA flags), keep tickets focused on value (not activity items), automate CI/CD-related tasks, and emphasize testing and a clear Definition of Done to speed delivery and reduce waste.

Booking11 months ago

Modernizing a Legacy Endpoint and Why It’s Worth It: a Step-by-Step Guide

Booking.com migrated a 14-year-old Update Management API out of a Perl monolith into a new Java service via a phased, incremental approach: UML visualization, stakeholder mapping, horizontal/vertical decomposition, deprecating obsolete fields (validated with A/B/multivariate experiments), and platform-specific modularization (iOS/Android). They addressed unexpected dependencies (deeplinking, marketing DB updates), added unit/integration tests and alerting/monitoring, and measured significant wins: ~30% latency improvement, 77% smaller payloads, ~1M fewer DB fetches/day, and 50% CPU reduction. The post highlights lessons on incremental migration, collaboration, and revisiting legacy code.

Booking11 months ago

Hexagonal Architecture: A Practical Guide

A practical guide to Hexagonal Architecture (ports-and-adapters) and Domain-Driven Design with a Java + Spring example (a bar system). The article explains domain-centric structure, inbound/outbound ports and adapters, shows code snippets, and recommends using ArchUnit and dependency injection to enforce boundaries and improve testability (integrated into CI/CD).

Booking11 months ago

Building the Future of Content: Inside Booking.com’s Intelligent Content Enrichment Platform

Booking.com's Content Intelligence Platform consumes photo/text streams (Kafka) and uses Apache Flink to run hundreds of ML models (image captioning, scores, tags, embeddings) in realtime and backfill modes. Results are persisted to relational and vector databases and a data lake and exposed via an HTTP API so product teams can fetch model outputs with low latency.

Booking11 months ago

Self-Serve Platform for Scalable ML Recommendations

Booking.com describes its self-serve Recommendation Platform (RecP): a standardized recommendation pipeline and component registry that let teams compose prediction graphs, reuse features and models, apply real-time filtering/enrichment, and scale experiment-driven ML recommendation strategies across multiple use cases.

Booking1 year ago

The Engineering Behind Booking.com’s Ranking Platform | A System Overview

Booking.com's Ranking platform is a production-scale ML-driven service that personalizes search results. Data from OLTP tables and Kafka streams flows into a data warehouse and feature store (batch and realtime features); models are trained and served via a dedicated ML platform cluster. The Accommodations Ranking service runs across multiple Kubernetes clusters (Java service behind Nginx, Dropwizard endpoints) and addresses strict p999 latency, fan-out from sharded availability workers, payload chunking, and inference optimization (quantization, pruning, hardware acceleration), with fallbacks and multi-stage ranking to maintain availability.

Booking1 year ago

Leverage graph technology for real-time Fraud Detection and Prevention

Booking.com describes using graph technology (JanusGraph with a Cassandra backend) to represent linked identifiers for real-time fraud detection. The system builds local graphs per request (using breadth-first search), computes graph features (counts, hops_to_fraud), and returns them to fraud-prediction models and experts, achieving p99 latencies around 300 ms.

Booking1 year ago

Events: The 4th pillar of Booking.com’s Observability platform

Booking.com describes its proprietary 'Events' observability system — a key-value event model produced by an in-house events library, routed via a host-level event-proxy into Kafka, and consumed by multiple consumers (tracing -> Honeycomb, APM metrics -> Graphite, failed-events -> Elasticsearch). The post explains why Events are useful, the architecture (Kubernetes and bare-metal hosts, Kafka pipelines), challenges of a proprietary format, and why Booking.com is evaluating a shift toward OpenTelemetry.

Booking1 year ago

BazelDay Amsterdam 2024 at Booking.com

Recap of BazelDay Amsterdam 2024 hosted by Booking.com and EngFlow, covering talks on Bazel adoption at Booking.com and Salesforce, reproducible cloud-based developer environments for Bazel (EngFlow), and JetBrains' IntelliJ Bazel plugin updates.

Booking1 year ago

Lessons in adopting Airflow

Booking’s AdTech team migrated on-premise Perl/MySQL workflows to Apache Airflow running on Google Cloud Composer. The post shares practical lessons: creating a local dockerized Airflow image to match Composer for fast developer iteration, tuning Celery concurrency and scheduler settings to avoid queued/stalled tasks and CPU spikes, delegating heavy computation to Dataproc (managed Spark) via built-in Dataproc operators, and using GCP IAM service-account impersonation + Application Default Credentials for secure access.

Booking1 year ago

How (Not) to Implement DORA Metrics

A satirical/how-not-to guide about implementing DORA metrics. The author explains the DORA metrics, lists common anti-patterns teams use to avoid real improvement (waiting for tooling, pushing large releases, blaming humans, over-relying on slow end-to-end tests, aiming for 100% reliability), and argues that doing the opposite (data-driven retros, smaller releases, CI/CD, balanced test pyramid, SLOs) leads to better software delivery performance.

Booking1 year ago

Unlocking observability: Structured logging in Spring Boot

Tutorial showing how to add structured JSON logging to a Spring Boot (Java) application using SLF4J + Logback and MDC for contextual fields, then ingest logs with Logstash into Elasticsearch and analyze/visualize them in Kibana (ELK stack). The article includes a companion demo project (Jamboree) and notes threading/MDC cleanup concerns and running the stack via docker compose.

Booking1 year ago

DORA Metrics At Work

A Booking.com fintech team doubled its delivery performance (DORA metrics) in a year without adding headcount by focusing on CI/CD improvements (automating deployments, trunk deployments, removing manual canaries), raising test coverage and test automation, working in small batches, improving code review turnaround, and migrating key UI pages to microfrontends. The gains came from process, culture, and quality improvements rather than new resources.

Booking1 year ago

Measuring mobile apps performance in production

Booking.com's engineering team describes their approach to measuring mobile app performance in production: they define precise metrics (app startup time, TTI, TTFR, and rendering Freeze Time), explain why third-party tools didn't meet their needs, rewrote their monitoring libraries from Objective-C/Java to Swift/Kotlin, and open‑sourced PerformanceSuite for iOS and Android on GitHub.

Booking2 years ago

Measuring Technical Debt to Avoid the Boiling Frog Syndrome

The article defines technical debt, explains its origins and risks (the "boiling frog" effect), and argues for making debt visible using measurable health and improvement indicators. It lists practical metrics (e.g., WTFs per minute, code smells, automated test coverage, documentation coverage, effort on deprecated components, unplanned work, defects, vulnerabilities, estimated effort to pay debt) and mentions tools/practices to support measurement (SonarQube, JaCoCo, OWASP dependency-check, CI/CD integration, Testing on the Toilet). The author emphasizes team culture, early corrective actions, and tracking progress when reducing debt.