Klarna engineering blog

Try:

Klarna5 months ago

How I stopped worrying and learned to love Cloud Inventory

Klarna built a Cloud Inventory system: a graph-database-backed catalog (using JupiterOne) populated by ETL/sync jobs from many internal sources to unify cloud assets across thousands of AWS accounts. They express compliance/security/operational expectations as "target specifications" (query-driven reports stored as .md files), which enable automated controls, faster rollouts, cost savings (e.g., snapshot cleanup), and better ownership. The article outlines architecture, lessons learned, and next steps (expand sources, use AI, add positive evidence).

Klarna6 months ago

Learnings from a Klarna Engineer on feature development

An engineer at Klarna shares practical learnings from building an Account Recovery feature, covering documentation, open communication (Slack), collaborative testing (Bug Bashes), product-led engineering and demos, pragmatic prototyping (in-memory databases), code hygiene (JS -> TypeScript migration), teaching/knowledge-sharing, time management, and developer experience improvements.

Klarna7 months ago

How micro should your microservices be?

Klarna describes moving from a monolith to many microservices (one per payment method), encountering a distributed‑monolith problem, trying a monorepo, and ultimately adopting a DDD‑guided modular monolith with configurable flows. The modular approach reduced duplication, improved observability and time‑to‑market, and scaled to handle over a million orders/day (including Black Friday peak). The article argues modular monoliths can combine monolith simplicity with modularity and can later be split into microservices if needed.

Klarna7 months ago

Guardians of consistency

Klarna engineers describe how they migrated from Mnesia to Postgres and implemented Mnesia-like serializable transaction guarantees on Postgres by combining SELECT FOR SHARE and non-blocking advisory locks while running under read committed. They debug false-positive serialization errors in Postgres, explain their locking strategy, and validate correctness with hermitage-based tests and extensive property-based testing in CI.

Klarna8 months ago

The fellowship of the forgotten

Klarna describes migrating its large Erlang-based KRED application from Mnesia (with mnesia_eleveldb/leveldb) to Postgres with zero downtime. The article covers the decision process, the kdb compatibility layer, typing the schema, bidirectional replication design (Mnesia↔Postgres), replication validation, the final cutover, and the move from prepared transactions to logical-decoding-based transaction log messages (requiring Aurora Postgres upgrade).

Klarna1 year ago

Automating the Klarna Card Ownership Fees System using AWS Step Functions

Klarna engineers automated their monthly Klarna Card ownership fee process by implementing an AWS Step Functions state machine (deployed via AWS CDK/CloudFormation) that orchestrates Glue PySpark jobs, SQS, Lambda decision logic and DynamoDB, and is triggered on a schedule via EventBridge/ Scheduler. The automation reduced manual work, improved reliability and monitoring (Datadog + OpsGenie), and followed IAM least-privilege practices.

Klarna1 year ago

Peak Season 2023 : How Klarna achieved consistent success

Klarna’s Peak Season 2023 post describes how the company prepared for and executed a high-traffic season through centralized readiness tooling, capacity prediction (Kapacity), rigorous performance testing (FLAS and Baseline approaches, load/spike/overload tests), runbooks and observability, DDoS fire drills, and cross-team organizational processes — resulting in fewer incidents, optimal cloud costs, and improved engineering experience.

Klarna1 year ago

Stop Misusing ROC Curve and GINI: Navigate Imbalanced Datasets with Confidence

The article explains why ROC_AUC and GINI can be misleading on imbalanced binary classification problems and argues that the Precision-Recall curve (PR_AUC) is often a more informative metric for evaluating performance on the minority class. It defines confusion-matrix terms, contrasts ROC and PR curves (TPR/FPR vs precision/recall), shows a concrete example comparing logistic regression and gradient boosting on a 2% default-rate dataset, and concludes with recommendations to choose metrics that fit the data and business problem.

Klarna2 years ago

Overcoming the Hurdle of Unformatted Input: What I Learned From Building a ChatGPT Add-On for…

An engineer describes building a ChatGPT add-on for Google Workspace (Sheets), covering integration via the OpenAI API using Google Apps Script, challenges with token and execution time limits, and handling raw CSV data by adding preprocessing and output-formatting modules; the author plans to integrate LangChain (Python) next and the add-on was released internally at Klarna.

Klarna2 years ago

Introducing native E2E testing: Learnings from the Senior Engineering Program for Women

An engineer recounts delivering native end-to-end (E2E) testing for Klarna's mini apps (iOS/Android) as part of the Senior Engineering Program for Women. The post covers technical changes (TypeScript/tsconfig adjustments, Appium/Mocha/Webdriver.io adoption, dependency handling with Diglett, ESLint patterns, test collocation), outcomes for pipelines and developer experience, and learnings about career growth and the program's impact.

Klarna3 years ago

How we aligned 200 teams to monitor services with SLOs (2/2)

Klarna describes how its engineering team rolled out an internal SLO platform (OSLO) to ~200 teams, drove adoption through documentation, domain meetings, and compliance workflows, integrated AWS API Gateway and Datadog Burn Rate alerting, tuned suppression rules and thresholds, and achieved broad organizational buy-in to enable company-wide burn-rate alerts.

Klarna3 years ago

How we aligned 200 teams to monitor services with SLOs (1/2)

Klarna’s engineering team describes how they designed and drove adoption of an SLO-based monitoring platform (initially "SLOPs", later "OSLO") across ~200 teams. They prototyped automation using existing metrics, partnered with Datadog (SLO beta and Burn Rate alerting), used AWS ELB metrics polled from CloudWatch for pragmatic onboarding, made decisive choices (availability = non-5xx, p99 latency, 30-day window), and focused on stakeholder engagement and simplicity to achieve broad adoption.

Klarna3 years ago

The secret to calculating features at decision time — in retrospect

Klarna describes implementing a Point-in-Time (retrospective) feature computation flow to recreate historical feature values used by their credit decision engine. They use Apache Spark (PySpark) in batch, run jobs on AWS Glue (versus EMR) to read AVRO events from S3, and architect a multi-layer Glue job pipeline that reduces data size before applying expensive Python UDF aggregations. The post covers scale and performance issues (terabytes of events, JVM vs Python worker overhead, shuffles), benefits (fast runtimes, code reuse, bootstrap of live state), and future plans (Glue 3.0, migrate to pandas UDFs).

Klarna3 years ago

Automating Resilience (2/2)

Klarna describes the implementation of a DB-backed "Reliable Tasks" system that schedules downstream calls inside DB transactions, marks and executes tasks in two phases across instances, handles retries and failure recovery, prioritizes tasks during peak load, and optimizes DB and thread-pool configurations (including DB trigger table changes and use of AWS Performance Insights). The article also covers observability tooling (metrics/alarming, API/UI) used to manage failed tasks.

Klarna3 years ago

Automating Resilience (1/2)

Klarna describes why their team built a custom SQL-backed "Reliable Tasks" system to automate retries and resilience for failed downstream calls. After evaluating generic job schedulers (Quartz, JS7, Spring @Scheduled + ShedLock) they implemented a DB-lock-based solution that provides exponential delayed retries, eventual consistency, idempotency, multi-instance coordination, metrics & alarms, tooling for inspecting/ retrying failed tasks, and auditability. A follow-up article will detail the implementation.

Klarna3 years ago

Why we strayed from our middleware stack for a micro-services framework called Steve

An engineer at Klarna describes building "Steve," a TypeScript/Node.js micro-services framework that replaces an Express-style middleware stack with dependency injection (built on Inversify). The post explains design goals (explicit relationships, loose coupling, strong typing, DI), the framework's main component types (Controller, Service, Client), connectors for request-scoped injections, testing benefits, and plans to open-source the project.

Klarna3 years ago

Get Closer to Machine Perception of the Web with the Klarna Product Page Dataset

The post introduces the Klarna Product Page Dataset, a benchmark of 51,000 real e-commerce product pages from over 8,000 merchants labelled with elements such as image, price, name, add-to-cart and go-to-cart buttons, plus a subject node for holistic context. It provides MHTML snapshots and an open-source WebTraversalLibrary for convenient rendering and snapshot manipulation, lowering the entry barrier for machine perception research on web pages. The dataset and tooling aim to accelerate experiments with graph neural networks, reinforcement learning agents and other ML models on complex, real-world DOM structures.

Klarna3 years ago

How to improve classification of e-commerce pages, incorporating multiple modalities

Klarna research evaluated multimodal classification of e‑commerce pages by combining screenshot images (EfficientNet B4 / CNN) and textual content (BERT). Models were pretrained and fine‑tuned on ~50k pages; the team compared single‑modality baselines to multimodal fusion approaches (weighted averaging of logits vs probabilities, and connecting nets via a linear layer). Multimodal fusion improved accuracy, and averaging logits proved more robust to errors in the mixing weight (alpha).

Klarna3 years ago

The hunt for the cluster-killer Erlang bug

Klarna’s post-mortem describes a production outage where a Kafka node failure triggered brod (the Erlang Kafka client) to buffer messages in a way that captured large process state. That led to exponential message-size growth when the producer attempted to send crash information, causing the BEAM VM's message-sizing code to lock up schedulers and request huge allocations. The result was loss of metrics, inability to open shells, OOM kills and cluster restarts. Fixes included a brod change (extracting the partition before creating the fun) and upstream Erlang/OTP discussion about preemptiveness and copying semantics.

Klarna4 years ago

How we publish the Klarna Point of Sale app, Part 5/5

Part 5 describes how Klarna distributes their Point of Sale native app to Apple App Store and Google Play. It explains using Jenkins input steps for a go/no-go and release notes, invoking App Center's Distribute action via API (curl/jq/Groovy) to create a release and then manually promoting the release in the stores (with Google Play Tracks noted as an automation option).