Incident.io logo

Incident.io

Insights from engineers and leaders in the incident management space about incident response best practices, tooling, hiring, and much more.

46 posts
Incident.io logo
Incident.io

Using Claude to power up your onboarding 🚀

A new Incident.io engineer describes how they used Claude (an LLM) to accelerate onboarding: using CLAUDE.md files to document and explain architecture, asking Claude to diagram event-driven flows, searching past tickets and crafting Grafana queries during debugging, and building a notes workflow to capture learnings and update documentation.

Incident.io logo
Incident.io

Ready, steady, goa: our API setup

Incident.io describes switching from manual Go HTTP handlers to a Goa-driven, design-first API setup. Using Goa's DSL they generate transport code, validation, typed service interfaces, OpenAPI specs, and a pipeline that emits specialized clients (TypeScript for frontend, mobile, test). The result is less boilerplate, a single source of truth across teams, and faster feature development.

Incident.io logo
Incident.io

Breaking through the Senior Engineer ceiling

A career-advice post for Senior engineers aiming for Staff/Principal levels that defines progression as a combination of experience, expertise, and wisdom. It outlines day-to-day behaviours and mindsets Senior+ engineers should demonstrate: technical leadership, multiplying others' impact, clear communication, composure under pressure (including incidents), company-first prioritisation, positivity and grit, and the ability to navigate ambiguity independently.

Incident.io logo
Incident.io

The quest for the five minute deploy

Incident.io migrated from a hosted CI provider to Buildkite self-hosted agents and reworked their test and caching strategies to reduce build and deploy times. Key changes included provisioning large-build servers, splitting test execution into a build-then-run flow for Ginkgo test binaries, mounting per-host Go build caches, adding Node module and Docker caches via Buildkite plugins and GCS, and iterating on resource sizing and bin-packing. The changes reduced branch builds, production deploys, and hotfix times significantly (hotfixes now under 5 minutes), though some operational challenges (IOPS, flaky tests, cache issues, zombie containers) remained.

Incident.io logo
Incident.io

Top engineering voices to follow in 2025

A curated, editorial list of notable engineering writers and speakers to follow in 2025, with short profiles explaining each person’s focus (leadership, platform engineering, SRE/observability, cloud, systems fundamentals, and developer experience). The post highlights why each voice is worth following and links to their blogs, newsletters, and social accounts.

Incident.io logo
Incident.io

How we're shipping faster with Claude Code and Git Worktrees

Incident.io describes how their team adopted Anthropic's Claude Code (a terminal-based AI coding assistant) and Git worktrees to run multiple isolated AI coding sessions in parallel. They share examples—building a richer JavaScript editor, improving API client generation via Makefile edits, a custom worktree manager bash function, and voice-driven development—plus a vision for CI-driven ephemeral preview environments. The post focuses on workflow changes, tooling improvements, and how AI accelerates feature development.

Incident.io logo
Incident.io

Pager fatigue: Making the invisible work visible

Incident.io describes designing and iterating a Fatigue Score to quantify pager-related disruption for on-call engineers. The article covers motivation, interview-driven variable selection, data collection and calibration against perceived fatigue, daily/weekly dashboarding in Slack, and how EMs and teams use the score to surface invisible overnight work and manage workload and wellbeing.

Incident.io logo
Incident.io

Our simple-to-use incident post-mortem template

A practical, simple-to-use template and guidance for writing incident post-mortems—covering how to document incidents, determine root causes, and track follow-up actions to improve reliability and team processes.

Incident.io logo
Incident.io

Navigating the role of an incident commander

A practical guide to the incident commander role: how to rapidly assess incidents, assign clear roles, maintain transparent communication (including dedicated business communicators), use runbooks and integrations, implement intelligent notifications to reduce alert fatigue, track MTTR, and run training/simulations and post-incident reviews to improve response.

Incident.io logo
Incident.io

A seven-step framework for running incident debriefs

A practical seven‑step, blameless framework for incident debriefs covering: set a blame‑free tone; reconstruct a factual timeline; assess user impact; evaluate monitoring/alerts and response; extract actionable lessons; assign owners and due dates; finish positively and document follow‑up. Emphasizes improving observability, clear action items, and building a learning culture.

Incident.io logo
Incident.io

Debugging deadlocks in Postgres

A practical guide to diagnosing and fixing Postgres deadlocks. It explains locks and deadlocks, recommends designing transactions to acquire locks in a consistent order, and describes alternatives (explicit/advisory locks, smaller lock scope, retries). In a case study, an intermittent deadlock during a bulk upsert was traced via Postgres logs and trace IDs to unordered rows (from a Go map) causing different per-row lock acquisition orders; sorting the rows before the upsert resolved the issue.

Incident.io logo
Incident.io

How data habits help build a data culture

The article describes practical strategies incident.io uses to build a data-driven culture by codifying small, repeatable "data habits": embedding dashboards into stakeholders' workflows (e.g., Salesforce embeds), pushing well-crafted alerts into Slack using reverse-ETL (Hightouch), running weekly insight presentations, maintaining shared data channels, and using office dashboards. The emphasis is on people and processes that make data accessible, relevant, and actionable rather than on technology alone.

Incident.io logo
Incident.io

The flight plan that brought UK airspace to its knees

A detailed postmortem of the Aug 28, 2023 NATS FPRSA-R failure: a valid ADEXP flight plan with duplicated waypoint codes triggered an edge-case bug that caused both primary and hot-standby converters to enter maintenance mode. The incident halted automated flight-plan processing for several hours, forced manual workarounds, reduced airspace capacity, and exposed weaknesses in escalation, comms, and system mapping. The report highlights technical and social resilience lessons and recommendations for improved testing, escalation procedures, and incident training.

Incident.io logo
Incident.io

How we page ourselves if incident.io goes down

Incident.io explains how they designed and tested a "dead man’s switch" backup paging system so engineers still get paged if their primary alerting platform degrades. The post covers the alert flow, implementation details (dual-send with a 1-minute delay and ack propagation), stress testing via game days (100x load), and ongoing smoke/testing practices to validate the fallback.

Incident.io logo
Incident.io

Organizing ownership: How we assign errors in our monolith

Incident.io describes how they encode team ownership inside their monolith (module files per package), enforce it in CI, generate a CODEOWNERS mapping, and use middleware that inspects stack traces to tag errors with the owning team so monitoring/alerting routes pages to the right on-call team — reducing pager noise while keeping a monolith manageable.

Incident.io logo
Incident.io

How we handle sensitive data in BigQuery

The post describes incident.io's approach to protecting sensitive data in BigQuery: default masking of all new columns via BigQuery policy tags, an automated Python workflow that applies/removes tags driven by YAML whitelists and PRs, split dbt pipelines for customer vs internal use (unmasked vs masked), and integrations to update a Notion data catalogue and send alerts. The article includes implementation details and code samples for the tagging workflow and Notion API calls.

Incident.io logo
Incident.io

How we model our data warehouse

A technical walkthrough of Incident.io’s data warehouse architecture and dbt-based modeling approach. The post details their staging → intermediate → marts layer pattern (including naming conventions), responsibilities of each layer, how they separate customer-facing (insights) and internal (dim/fct) marts, and BI-tool integration principles (pre-joined marts, avoid surfacing intermediate/staging, don’t over-model). It closes with operational advice for scaling these practices (usage logs to drive modeling decisions).

Incident.io logo
Incident.io

Observability as a superpower

A practical guide to using tracing (OpenTelemetry) and spans at incident.io: what traces and spans represent, naming and attribute conventions, code-level instrumentation examples (Go), and how traces integrate with logs, Sentry and tooling (Google Cloud Trace, Grafana Tempo). The author details how traces speed up incident diagnosis, querying strategies (TraceQL), tips for local and production setups, and cost/attribute considerations.

Incident.io logo
Incident.io

Choosing the right Postgres indexes

Practical guide to Postgres indexing: what indexes are and how they speed queries, examples using EXPLAIN/EXPLAIN ANALYZE, when to add indexes (high row discard, ORDER BY, uniqueness), types of indexes (b-tree by default, GIN/GiST/BRIN, etc.), composite and partial index considerations, tradeoffs (disk and write overhead), and how to detect missing indexes using observability tools (Grafana, Query Insights) with a short remediation workflow.

Incident.io logo
Incident.io

What is a SEV1 incident? Understanding critical impact and how to respond

Explains what SEV1 (Severity 1) incidents are, how they compare to other severity levels, criteria to identify SEV1 events, and the immediate response process (roles, communication channels, and mitigation steps). Covers prevention (monitoring, automated testing, load balancing, capacity planning), regular incident drills, and blameless post-incident reviews with actionable learnings and documentation to improve future readiness.