公司:slack
Slack是由Slack技术所开发的一款基于云端运算的即时通讯软件,现属赛富时所有。Slack这个词其实是一个缩写,意思是“所有可搜索的会话和知识日志”(Searchable Log of All Conversation and Knowledge)。
Slack Audit Logs and Anomalies
What are Slack Audit Logs? Like many Software as a Service (SaaS) offerings, Slack provides audit logs to Enterprise Grid customers that record when entities take an action on the platform. For…
Astra Dynamic Chunks: How We Saved by Redesigning a Key Part of Astra
Introduction Slack handles a lot of log data. In fact, we consume over 6 million log messages per second. That equates to over 10 GB of data per second! And it’s all stored using Astra, our in-house,…
We’re All Just Looking for Connection
We’ve been working to bring components of Quip’s technology into Slack with the canvas feature, while also maintaining the stand-alone Quip product. Quip’s backend, which powers both Quip and canvas, is written in Python. This is the story of a tricky bug we encountered last July and the lessons we learned along the way about being careful with TCP state. We hope that showing you how we tackled our bug helps you avoid — or find — similar bugs in the future!
Advancing Our Chef Infrastructure
At Slack, we manage tens of thousands of EC2 instances that host a variety of services, including our Vitess databases, Kubernetes workers, and various components of the Slack application. The majority of these instances run on some version of Ubuntu, while a portion operates on Amazon Linux. With such a vast infrastructure, the critical question arises: how do we efficiently provision these instances and deploy changes across them? The solution lies in a combination of internally-developed services, with Chef playing a central role. In this blog post, I’ll discuss the evolution of our Chef infrastructure over the years and the challenges we encountered along the way.
Unified Grid: How We Re-Architected Slack for Our Largest Customers
All software is built atop a core set of assumptions. As new code is added and new use-cases emerge, software can become unmoored from those assumptions. When this happens, a fundamental tension arises between revisiting those foundational assumptions—which usually entails a lot of work—or trying to support new behavior atop the existing architecture. The latter approach is usually advised, to save time and reduce risk.
However, there are times when it’s worth revising the core architecture of a large software application. Recently at Slack we did just that, taking a step back to change how our backend and clients (the desktop and mobile applications) work on a foundational level.
Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack
Slack Data Engineering recently migrated their data workload from EMR 5 to EMR 6, using Spark 3 as the processing engine. The migration aimed to improve performance, enhance security, and achieve cost savings. They faced challenges related to supporting the same Hive catalog, provisioning different EMR clusters, controlling costs, and supporting different versions of job libraries. They used various tools and techniques like the Hive Schema Tool, Bazel, and the Airflow Spark operator to address these challenges. The migration allowed them to leverage the benefits of Spark 3 and improve their data processing capabilities. They also performed post-migration data validation to ensure an exact data match between the tables and made use of Trino and their in-house Python framework for detailed analysis. They continuously monitored the runtime of their pipelines and made necessary adjustments.
Proactive Measures Against Password Breaches and Cookie Hijacking
Slack采取主动措施和创新的自动化技术,保护用户免受潜在的侵犯。当Slack的Cookie失效时,与之关联的会话将被标记为终止,完成后用户将被注销出他们的工作空间。这对于保护用户的账户免受未经授权的访问是一件好事,但我们也知道在关键对话或在会议中演示时,没有人希望失去对Slack的访问。因此,在运行时,我们的自动化会检查每个受损的Cookie,评估关联用户的地理位置是否意味着在他们通常的工作时间内。如果是这样,该特定Cookie的失效将安排在工作时间范围之外的时间窗口内,而属于当前不在工作时间内的用户的Cookie将立即失效。这样我们就可以根据每个用户的时区提供积极的用户体验,同时计算出最高效和及时的失效时间,以保护被窃取的Cookie。
Catching Compromised Cookies
Slack uses cookies to track session states for users on slack.com and the Slack Desktop app. The ever-present cookie banners have made cookies mainstream, but as a quick refresher, cookies are a…
Balancing Old Tricks with New Feats: AI-Powered Conversion From Enzyme to React Testing Library at Slack
In the world of frontend development, one thing remains certain: change is the only constant. New frameworks emerge, and libraries can become obsolete without warning. Keeping up with the ever…
The Scary Thing About Automating Deploys
Most of Slack runs on a monolithic service simply called “The Webapp”. It’s big – hundreds of developers create hundreds of changes every week.
Deploying at this scale is a unique challenge. When people talk about continuous deployment, they’re often thinking about deploying to systems as soon as changes are ready. They talk about microservices and 2-pizza teams (~8 people). But what does continuous deployment mean when you’re looking at 150 changes on a normal day? That’s a lot of pizzas…
Building Custom Animations in the Workflow Builder
Slack's Workflow Builder has introduced improvements to its drag-and-drop feature. The development team implemented custom animations, including a tilt effect, to enhance the user experience during step dragging. They also created dynamic placeholders to indicate valid drop locations for steps. Spacing issues caused by hint boxes were solved by hiding them while dragging. The team utilized the onBeforeCapture
responder to handle the state updates properly. Through these enhancements, Slack aims to provide a pleasant and productive experience for users, showcasing their dedication to craftsmanship.
The Query Strikes Again
数据存储团队通过实施限流机制和断路器模式,有效地保护数据库免受过多查询的影响。他们还采用了指数退避算法来适应失败的作业,并停止重试。此外,忘记用户作业通过优化查询和减少负载,对减轻主数据库压力发挥了重要作用。这些措施有助于确保Slack数据库基础设施的稳定性和可靠性,提供流畅的用户体验,并减轻类似事件的影响。
Executing Cron Scripts Reliably At Scale
Cron scripts are responsible for critical Slack functionality. They ensure reminders execute on time, email notifications are sent, and databases are cleaned up, among other things. Over the years, both the number of cron scripts and the amount of data these scripts process have increased. While generally these cron scripts executed as expected, over time the reliability of their execution has occasionally faltered, and maintaining and scaling their execution environment became increasingly burdensome. These issues lead us to design and build a better way to execute cron scripts reliably at scale.
Running cron scripts at Slack started in the way you might expect. There was one node with a copy of all the scripts to run and one crontab file with the schedules for all the scripts. The node was responsible for executing the scripts locally on their specified schedule. Over time, the number of scripts grew, and the amount of data each script processed also grew. For a while, we could keep moving to bigger nodes with more CPU and more RAM; that kept things running most of the time. But the setup still wasn’t that reliable — with one box running, any issues with provisioning, rotation, or configuration would bring the service to a halt, taking some key Slack functionality with it. After continuously adding more and more patches to the system, we decided it was time to build something new: a reliable and scalable cron execution service. This article will detail some key components and considerations of this new system.
Traffic 101: Packets Mostly Flow
Slack handles billions of inbound network requests per day, all of which traverse through our edge network and ingress load balancing tiers. In this blog post, we’ll talk about how a request flows — from a Slack’s user perspective — across the vast ether of the network to reach AWS and then Slack’s internal services.
Real-time Messaging
Did you know that ground stations transmit signals to satellites 22,236 miles above the equator in geostationary orbits, and that those signals are then beamed down to the entire North American subcontinent? Satellite radios today serve hundreds of channels across 9,540,000 square miles. Unless you’re working at a secret military facility, deep underground, you can enjoy satellite radio everywhere.
Just like the satellites, Slack sends millions of messages every day across millions of channels in real time all around the world. If we look at the traffic on a typical work day, it shows that most users are online between 9am and 5pm local time, with peaks at 11am and 2pm and a small dip in between for lunch hour. Though the working hours are similar across regions, looking at the two peaks in the graph below, it is evident that prime time is not the same: It’s post-noon in some regions and pre-noon in other regions. Each colored line in the below graph represents a region.
Tracing Notifications
Notifications are a key aspect of the Slack user experience. Users rely on timely notifications of mentions and DMs to keep on top of important information. Poor notification completeness erodes the trust of all Slack users.