Did you know that ground stations transmit signals to satellites 22,236 miles above the equator in geostationary orbits, and that those signals are then beamed down to the entire North American subcontinent? Satellite radios today serve hundreds of channels across 9,540,000 square miles. Unless you’re working at a secret military facility, deep underground, you can enjoy satellite radio everywhere.
Just like the satellites, Slack sends millions of messages every day across millions of channels in real time all around the world. If we look at the traffic on a typical work day, it shows that most users are online between 9am and 5pm local time, with peaks at 11am and 2pm and a small dip in between for lunch hour. Though the working hours are similar across regions, looking at the two peaks in the graph below, it is evident that prime time is not the same: It’s post-noon in some regions and pre-noon in other regions. Each colored line in the below graph represents a region.
Notifications are a key aspect of the Slack user experience. Users rely on timely notifications of mentions and DMs to keep on top of important information. Poor notification completeness erodes the trust of all Slack users.
This blog post discusses the strategies that Slack uses to manage the lifecycle (development, support, and eventual retirement) of infrastructure projects, through the lens of the migration through three successive internal “platform” offerings.
Hakana: Taking Hack Seriously
We started migrating to a different language called Hack in 2016. Hack was created by Facebook after they had struggled to scale their operations with PHP. It offered more type-safety than PHP, and it came with an interpreter (called HHVM) that could run PHP code faster than PHP’s own interpreter.
Mobile Developer Experience at Slack
The mobile developer experience team empowers developers to ship code with confidence while enjoying a pleasant and productive engineering experience.
BuildRock: A Build Platform at Slack
Our build platform is an essential piece of delivering code to production efficiently and safely at Slack. Over time it has undergone a lot of changes, and in 2021 the Build team started looking at the long-term vision.
Some questions the Build team wanted to answer were:
- When should we invest in modernizing our build platform?
- How do we deal with our build platform tech debt issues?
- Can we move faster and safer while building and deploying code?
- Can we invest in the same without impacting our existing production builds?
- What do we do with existing build methodologies?
In this article we will explore how the Build team at Slack is investing in developing a build platform to solve some existing issues and to handle scale for future.
Slowing Down to Speed Up - Circuit Breakers for Slack's CI/CD
How Slack increased developer productivity and prevented cascading internal failures by implementing orchestration-level circuit breakers.
AutoTransform: Efficient Codebase Modification
How Slack is bringing automation to bear to solve the problem of maintaining, modifying, and upgrading codebases.
Building Background Effects for Clips
Last September, Slack released Clips, allowing users to capture video, audio, and screen recordings in messages to help distributed teams connect and share their work. We’ve continued iterating on Clips since its release, adding thumbnail selection, background blur, and most recently, background image replacement.
This blog post provides a deep dive into our implementation of background effects (background blur and background image replacement) for browsers and the desktop client. We’ve used a variety of web technologies, including WebGL and WebAssembly, to make background effects as performant as possible on our desktop platforms.
Scaling Slack’s Mobile Codebases: Modernization
In the first two posts about the Duplo initiative, we described why we decided to revamp our mobile codebases, the initial phase to clean up tech debt, and our efforts to modularize our iOS and Android codebases (post 1, post 2). In this final post, we will discuss the last theme of the Duplo initiative, Modernization, and look at the overall results and impact on developers.
Slack’s Incident on 2-22-22
Slack experienced a major incident on February 22 this year, during which time many users were unable to connect to Slack, including the author — which certainly made my role as Incident Commander more challenging!
Handling Flaky Tests at Scale: Auto Detection & Suppression
This post describes the path we have taken to minimize the number of flaky tests through an approach of automated test failure detection and suppression. This is not a new problem that we are trying to solve; many companies have published articles on systems created for handling flaky tests. This article outlines how test flakiness is an increasing problem at scale and how we got it under control at Slack.
Balancing Safety and Velocity in CI/CD at Slack
A story of evolving socio-technical workflows that increased developer velocity and redefined confident testing and deploy workflows at Slack.
The Case of the Recursive Resolvers
On September 30th 2021, Slack had an outage that impacted less than 1% of our online user base, and lasted for 24 hours. This outage was the result of our attempt to enable DNSSEC — an extension intended to secure the DNS protocol, required for FedRAMP Moderate — but which ultimately led to a series of unfortunate events.
The internet relies very heavily on the Domain Name System (DNS) protocol. DNS is like a phone book for the entire internet. Web sites are accessed through domain names, but web browsers interact using IP addresses. DNS translates domain names to IP addresses, so that browsers can load the sites you need. Refer to ‘What is DNS?’ by Cloudflare to read more about how DNS works and all the necessary steps to do a domain name lookup.
How Two Interns Are Helping Secure Millions of Lines of Code
At Slack, proactively securing our systems is a top priority. One way we achieve this is by automating the detection of security issues with static code analysis, which are tools that inspect programs without executing them. They’re often used with security-based rules to automate identification of vulnerabilities and insecure programming practices, which frees up more bandwidth for security engineers. For us, expanding our static code analysis program became critical as we looked to grow into the public sector, where there are rising demands to show our feature work is secure and to meet security certification requirements. We view static code analysis as guardrails; it prevents the worst kinds of security vulnerabilities from joining our codebases. As a result, static code analysis has been top of mind for the security team at Slack for the past three quarters and remains one of the major focuses for next quarter.
Our codebase is largely written in Hack. While Hack comes from work that Facebook performed to develop a typed version of PHP, it is a separate language and there are no static analysis tools broadly available for it. Given that over 5 million lines of code at Slack are written in Hack, how can we ensure it remains secure at scale?
Infrastructure Observability for Changing the Spend Curve
A deep dive on how we crafted an order of magnitude change in our spend (10x reduction compared to baseline growth) over the last two years with iterative understanding and changes in Slack’s Continuous Integration (CI) infrastructure.