Grab engineering blog

Try:

Grab2 days ago

How We Built a Custom Vision LLM to Improve Document Processing at Grab

Grab developed a specialized Vision LLM pipeline to improve OCR and key information extraction for Southeast Asian documents. They evaluated Qwen2-VL 2B, generated synthetic OCR data and an auto-labeling/preprocessing platform (Documint), experimented with LoRA and full fine-tuning, then built a custom ~1B model by combining a strong vision encoder with a compact decoder. The custom model achieves near-2B accuracy with much lower latency and improved performance on Thai and Vietnamese scripts.

Grab1 week ago

Machine-learning predictive autoscaling for Flink

Grab built a machine-learning driven predictive autoscaler for Flink that forecasts Kafka source throughput and maps it to required TaskManager CPU using a time-series forecasting model plus a regression-based resource predictor. By vertically scaling CPU ahead of demand, the system avoids reactive restart spikes and scaling spirals, improves stability, and reduced cloud CPU cost by roughly >35%; it’s been rolled out to a majority of applicable pipelines and will expand to memory tuning and improved model workflows.

Grab2 weeks ago

Modernising Grab’s model serving platform with NVIDIA Triton Inference Server

Grab migrated its Catwalk model-serving platform to NVIDIA Triton to reduce technical debt and improve inference performance and cost. They built a Triton manager (server manager + proxy) to provide a backward-compatible, drop-in replacement enabling zero-downtime rollouts. Benchmarks showed large throughput and latency gains (e.g., transformer p90 dropped notably), and early rollouts delivered substantial cost savings and stability improvements.

Grab1 month ago

Highly concurrent in-memory counter in GoLang

Grab describes designing a highly concurrent in-memory counter in Go to batch updates to the database, comparing Mutex-based locking vs sync.Map atomic operations. The sync.Map approach improved concurrency (~30% faster in benchmarks) and materially reduced DB QPS and CPU usage in production. The article includes implementation details, code patterns (LoadOrStore, CompareAndSwap, LoadAndDelete), and benchmark results.

Grab1 month ago

User foundation models for Grab

Grab built a bespoke foundation model that jointly learns from tabular user attributes and time-series clickstream data using a transformer with modality-specific adapters. They pre-train the model with masked reconstruction and next-action prediction, handle massive ID vocabularies via hierarchical classification, and extract long-term and short-term embeddings for many downstream tasks. The post also details production challenges and solutions—terabyte-scale training, CPU/GPU workload separation, and distributed batch inference with Ray to generate millions of daily embeddings.

Grab1 month ago

Powering Partner Gateway metrics with Apache Pinot

Grab describes powering Partner Gateway time-series metrics using Apache Pinot. The article walks through the ingestion pipeline (Datadog -> Kafka -> protobuf -> Flink -> Pinot), challenges with large-scale aggregation queries and timeouts, and optimizations (partitioning by metric, rounded time columns, and using Pinot's star-tree index) that dramatically reduced query latency at the cost of increased storage (with AWS gp3 EBS cost analysis).

Grab1 month ago

Taming the monorepo beast: Our journey to a leaner, faster GitLab repo

Grab’s infra team pruned and migrated a decade-old Go monorepo by keeping tagged releases and recent history and building a custom, resumable migration script with robust Git LFS handling. The change reduced commits by ~99.9%, cut storage ~59%, and improved Gitaly replication latency from minutes to milliseconds, yielding much faster clones, snappier GitLab UI, and better CI scalability.

Grab2 months ago

Data mesh at Grab part I: Building trust through certification

Grab describes its Signals Marketplace data mesh and data certification program to improve trust and reusability of data products by decentralising ownership (BDOs/TDOs), enforcing data contracts and SLOs, automating Data Production Incidents for contract breaches, and using tooling and dashboards (Genchi, Hubble/DataHub). Outcomes include 75% of queries hitting certified assets, accelerated deprecation of redundant tables, and improved cross-domain reuse.

Grab3 months ago

The evolution of Grab's machine learning feature store

Grab redesigned its feature store from a feature-centric Amphawa to a feature-table centric platform: data scientists build feature tables as Parquet in a data lake (S3) using SQL/Python on Spark, register and deploy collections, and then atomically ingest (reverse ETL) into an online serving layer implemented on Amazon Aurora for Postgres. The new design focuses on atomic updates, read/write isolation, decentralised deployments, a thin SDK around the DB driver, and operational choices (serverless writer, provisioned readers, AZ-distributed storage) to meet scalability and availability requirements.

Grab3 months ago

Grab's service mesh evolution: From Consul to Istio

Grab migrated its service mesh from Consul (with a fallback called Catcher) to Istio to address single-point-of-failure risks and to gain robust multi-cluster, multi-cloud support, advanced traffic management, and improved observability. The team designed a resilient Istio deployment using multiple dedicated Kubernetes control planes (active-active pairs), executed staged cross-cloud migrations (GCP to AWS) for HTTP and gRPC traffic, and emphasized monitoring, gradual traffic shifting and rollback capabilities.

Grab4 months ago

DispatchGym: Grab’s reinforcement learning research framework

DispatchGym is Grab’s Python-based reinforcement learning research framework for dispatch systems. It combines a modular, directionally accurate simulator with an RL optimization interface (Gymnasium API), uses Numba for simulation speed-ups, and runs experiments via a CLI that spawns Spark jobs. It was used to prototype contextual bandit approaches (linear, neural-linear, Gaussian-process) and sampling strategies (epsilon-greedy, Thompson sampling, SquareCB, FastCB) to tune dispatch hyperparameters.

Grab4 months ago

Counter Service: How we rewrote it in Rust

Grab rewrote a high-QPS Counter Service from Golang to Rust as a fresh, black-box rewrite. They evaluated Rust's ecosystem (gRPC/tonic, actix-web, redis clients, scylla driver, nom, etc.), handled gaps by reimplementing needed parsers, and overcame Rust-specific pitfalls (borrow checker, async vs blocking calls). The Rust service delivered similar P99 latency but used ~20% of the CPU resources (≈5x resource savings, ~80% cost reduction). The team concluded the rewrite was worthwhile for this service and recommends Rust for new microservices if teams invest in learning Rust's async and lifetime concepts.

Grab4 months ago

The complete stream processing journey on FlinkSQL

Grab describes replacing per-user Zeppelin notebooks with a shared FlinkSQL gateway and custom control plane to provide an interactive SQL-based stream processing platform. The system uses a three-layer architecture (compute, integration, query) to submit SQL queries against Kafka (cataloged via Hive Metastore), manages sessions via REST APIs, exposes headless endpoints for programmatic queries, and offers a low-code UI and connectors to productionise SQL-expressed Flink pipelines. The change reduced cold-start times and streamlined deployment of streaming jobs.

Grab5 months ago

Effortless enterprise authentication at Grab: Dex in action

Grab implemented a centralised identity solution using OpenID Connect and the CNCF project Dex to unify authentication and authorisation across internal and external apps. They integrated Dex with their Concedo R2PM system, implemented token exchange (minting + trust relationships) for service-to-service identity, and added an IdP failover "kill switch" to improve resilience and standardise sign-on.

Grab5 months ago

From failure to success: The birth of GrabGPT, Grab’s internal ChatGPT

The author describes building an internal ChatGPT-like tool (GrabGPT) after an initial attempt to create a documentation-driven support chatbot failed due to LLM token limits and embedding search shortcomings. The post explains they used an existing chatbot-ui framework, added Google login for authentication, and deployed the tool using Grab’s model-serving platform (catwalk). GrabGPT is model-agnostic (supports multiple LLM providers), runs on a private route to keep data inside the company, and logs interactions for auditability. The write-up focuses on engineering trade-offs around token limits, retrieval approaches, deployment, and access control rather than implementation-level code or architecture diagrams.

Grab6 months ago

Streamlining RiskOps with the SOP agent framework

The post describes an SOP-driven LLM agent framework that automates ATO investigations by converting human workflows into machine-executable SOPs. SOPs are authored in an indented natural-language format with @function_name markers to indicate external calls and integration points. A dynamic execution engine — comprising an SOP planner and Worker Agent — drives a DFS-inspired structured execution: the planner generates steps and API calls while the Worker executes JSON-formatted SOP steps, runs SQL queries, invokes APIs, and stores results. The framework iteratively evaluates step-specific criteria and synthesises a final report summarising findings and decisions.

Grab6 months ago

Introducing the SOP-driven LLM agent frameworks

The post describes an SOP-driven LLM agent framework that represents standard operating procedures as an indented tree of nodes for actions and decision points, supported by an SOP editor for non-technical users. A planner module uses a depth-first search algorithm with backtracking to traverse the SOP tree while an LLM-powered worker agent executes steps, applying context compression and API tooling limits to reduce hallucination. The system supports dynamic branching, multilingual user agents, a state stack for pausing/resuming execution, and a plugin system for integrating external APIs and tooling. It also describes a GRAG retrieval pipeline, workflow chaining, and security measures including granular access control and execution logging for explainability.

Grab6 months ago

Evaluating performance impact of removing Redis-cache from a Scylla-backed service

Grab investigated periodic 15-minute Scylla QPS spikes caused by a Redis caching pattern (15-minute truncation + 5-minute TTL). They progressively disabled Redis for configurations, relying on Scylla's native cache and adding a daytime major compaction job. Result: 15-minute spikes disappeared, Scylla-bound QPS doubled while Scylla cache hit rate rose (≈30%→≈70%), latency returned to acceptable levels after compaction scheduling, CPU/memory remained stable, and they removed an expensive Redis component (~25% of cost) without degrading service performance.

Grab7 months ago

Facilitating Docs-as-Code implementation for users unfamiliar with Markdown

Grab implemented a WYSIWYG TechDocs editor in their Backstage portal (Helix TechDocs) to make Docs-as-Code accessible to non-engineers. They built the editor using the Lexical framework, integrated it with GitLab via OAuth2 and merge-request workflows, support MkDocs-based sites and diagram tools (Kroki, Mermaid, draw.io, Excalidraw), and rolled features out in phases (basic editor, independent doc creation, advanced editor features) including live preview and conflict-mitigation UX.

Grab7 months ago

Improving Hugo stability and addressing oncall challenges through automation

Grab’s engineering team enhanced stability and reduced on-call toil for Hugo (their data-ingestion platform) by building an automation stack: Signal collection (failure callbacks, SLA alerts, data-quality tests), a Diagnosis module that uses signals instead of heavy log parsing, an RCA table, an Auto-resolution framework (retry/backoff handlers), a Data Health API (integrated with Kinabalu), and a Data Health Workbench dashboard. Implementation notes cover Airflow callbacks, Genchi for data-quality checks, parallel diagnosis constrained by Kafka partitions, avoidance of parsing Spark/Airflow logs, and planned Flink integration. Outcomes include improved visibility, faster resolution, and reduced on-call workload.