Slack engineering blog

Try:

Slack1 week ago

Advancing Our Chef Infrastructure: Safety Without Disruption

Slack improved Chef/EC2 provisioning safety by splitting a single production Chef environment into multiple AZ-mapped environments, using a canary (prod-1) plus a release-train rollout for prod-2..prod-6, and moving from cron-scheduled Chef runs to an event-driven model. They built Chef Librarian to publish promotions to S3 and Chef Summoner on each node to schedule runs using splay, with a 12-hour fallback cron for compliance and recovery. These changes reduce blast radius during deployments and prepare for a planned replacement platform (Shipyard).

Slack4 weeks ago

Deploy Safety: Reducing customer impact from change

Slack’s Deploy Safety program is a year-and-a-half initiative to reduce customer-impact hours from change-triggered incidents by improving detection, instrumentation, and remediation (especially automatic rollbacks), and by standardizing and centralizing deployment processes. The team defined North Star goals and a program metric (hours of customer impact), invested in wide but prioritized projects (metric-based deploys, orchestration, rollback tooling), tracked trailing results, and documented lessons about measurement, training, and executive alignment. The program achieved substantial reductions in customer-impact hours and is continuing to expand automation and orchestration.

Slack2 months ago

Building Slack’s Anomaly Event Response

Slack’s Anomaly Event Response (AER) uses a multi-tiered architecture – a real-time detection engine with adaptive, organization-specific thresholds; a decision framework validating anomalies against customer configuration and preventing response loops; and an orchestrator that enqueues asynchronous jobs to terminate sessions and generate detailed audit logs. The system autonomously identifies suspicious behaviors (e.g., Tor access, data exfiltration patterns, unexpected API volumes) and compresses detection-to-response time from hours or days to minutes while offering configurable notification and logging for transparency.

Slack6 months ago

Optimizing Our E2E Pipeline

Slack's DevXP team optimized their E2E CI/CD pipeline by using git diff to conditionally skip frontend builds and reusing prebuilt frontend assets stored in AWS S3 and served via an internal CDN, resulting in ~60% fewer builds, ~50% faster builds, terabytes saved in S3, and reduced test flakiness.

Slack8 months ago

How we built enterprise search to be secure and private

Slack’s enterprise search leverages a federated, real-time Retrieval Augmented Generation (RAG) architecture built atop their Slack AI platform, ensuring no external data is stored and permissions are always up to date. It integrates with external systems via OAuth to fetch only user-authorized, read-scoped content at runtime under a least-privilege model. The design reuses Slack’s escrow VPC hosting for LLM inference and existing compliance infrastructure to maintain enterprise-grade security and privacy.

Slack10 months ago

Automated Accessibility Testing at Slack

Slack integrated automated accessibility checks into their desktop end-to-end tests by using Axe (axe-core) with Playwright. They initially explored embedding checks in React Testing Library + Jest but moved to Playwright due to framework complexities. They built helpers to run Axe, filter duplicates and non-critical violations, save artifacts (screenshots and reports) via the Playwright HTML reporter, and wire results into triage workflows (Jira + Slack). They run scheduled Buildkite regression jobs and plan future AI-assisted improvements.

Slack10 months ago

Migration Automation: Easing the Jenkins → GHA shift with help from AI

Slack intern built a conversion tool to migrate ~242 Jenkins pipelines to GitHub Actions. The pipeline uses the GitHub Actions Importer, then a corrections tool written with Python (regex/string edits) and LLM-driven rewrites to fix importer gaps (notably replacing actions, mirroring rate-limited actions, and adding helpful comments). The tool reduced projected migration time by roughly 1,300 hours (about 80% savings) and relied on prompt engineering to get accurate AI-driven action replacements.

Slack11 months ago

Break Stuff on Purpose

Slack describes a Kibana/Elasticsearch outage caused by co-locating storage and application and by stale backups/runbooks. The team implemented scheduled backups, fixed runbooks and S3 retention, intentionally broke a dev cluster to validate recovery, migrated to Kubernetes, automated the backup/restore into a CLI, and recommends recurring chaos-testing of systems to find latent failures.

Slack11 months ago

Slack Audit Logs and Anomalies

Explains Slack Audit Logs and the anomaly events included in them: where to access logs (UI and API), connectors (Splunk, AWS AppFabric, Datadog), how anomalies are generated and interpreted, strategies to investigate and correlate anomalies (user_agent, ip_address, session_fingerprint, file_downloaded), how to allowlist CIDR/ASNs, and operational guidance for aggregating anomalies and taking remediation actions (e.g., sign out a user).

Slack11 months ago

Astra Dynamic Chunks: How We Saved by Redesigning a Key Part of Astra

Slack redesigned Astra’s chunk allocation from fixed-size to dynamic chunks to reduce wasted cache space and costs. They changed cache-node behavior and the Cluster Manager: persisting cache-node assignments and metadata in Zookeeper, advertising node capacity instead of fixed slots, and using a first-fit bin-packing algorithm to assign chunks to nodes. The rollout used replica hosting and feature flags. Results included up to 50% fewer cache nodes in some clusters and ~20% lower cache-node costs.

Slack11 months ago

There’s No Such Thing as a Free Lunch!

Slack describes their "Incident Lunch" exercise — a low-cost, repeatable incident-response training where participants role-play an incident (Incident Commander, Slack incident channel, periodic status posts), use "chaos cards" to introduce unpredictability, and practice getting lunch under constraints. The post covers setup (github/github pages slide deck, facilitator roles, logistics), running the exercise, lessons learned, and how it scales across teams and remote offices.