Advancing Our Chef Infrastructure: Safety Without Disruption
Slack improved Chef/EC2 provisioning safety by splitting a single production Chef environment into multiple AZ-mapped environments, using a canary (prod-1) plus a release-train rollout for prod-2..prod-6, and moving from cron-scheduled Chef runs to an event-driven model. They built Chef Librarian to publish promotions to S3 and Chef Summoner on each node to schedule runs using splay, with a 12-hour fallback cron for compliance and recovery. These changes reduce blast radius during deployments and prepare for a planned replacement platform (Shipyard).