Cron scripts are responsible for critical Slack functionality. They ensure reminders execute on time, email notifications are sent, and databases are cleaned up, among other things. Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome. These issues lead us to design and build a better way to execute cron scripts reliably at scale.
Running cron scripts at Slack started in the way you might expect. There was one node with a copy of all the scripts to run and one crontab file with the schedules for all the scripts. The node was responsible for executing the scripts locally on their specified schedule. Over time, the number of scripts grew, and the amount of data each script processed also grew. For a while, we could keep moving to bigger nodes with more CPU and more RAM; that kept things running most of the time. But the setup still wasn’t that reliable — with one box running, any issues with provisioning, rotation, or configuration would bring the service to a halt, taking some key Slack functionality with it. After continuously adding more and more patches to the system, we decided it was time to build something new: a reliable and scalable cron execution service. This article will detail some key components and considerations of this new system.