公司：slack的相关资料

From Custom to Open: Scalable Network Probing and HTTP/3 Readiness with Prometheus

团队面临HTTP/3监控盲区，传统工具无法探测基于QUIC协议的新端点。实习生Sebastian主导开发了Prometheus黑盒探测器的QUIC扩展模块，采用quic-go库实现HTTP/3客户端支持，并将成果开源。该方案统一了HTTP各版本监控数据，提升告警可靠性，为行业HTTP/3迁移提供通用解决方案。未来计划扩展SNI路由验证和路径可视化功能。

slack技术

Streamlining Security Investigations with Agents

Slack安全工程团队利用AI代理优化安全事件调查流程。通过将复杂调查分解为多个模型调用，每个调用有明确的任务和结构化输出，提升控制精度。设计包括导演、专家和评论家三类代理，分别负责推进调查、生成发现和评估质量。系统采用知识金字塔策略，低成本模型处理基础数据，高成本模型整合关键发现。实时仪表板支持监控和调试，确保高效协作。AI代理在调查中展现自发发现能力，显著提升安全防御效率。

slack技术

Build better software to build software better

为解决构建耗时60分钟的问题，团队结合高性能构建工具Bazel和经典软件工程原则，优化构建流程。通过定义清晰的依赖图、缓存和并行化策略，团队将构建时间大幅缩短至10-30分钟。关键在于分离前端、后端和构建代码的耦合，设计可组合的构建单元，提升缓存命中率和并行效率。优化后的构建不仅更快，还增强了系统的整体可维护性和开发效率。

slack技术

Advancing Our Chef Infrastructure: Safety Without Disruption

Slack通过拆分生产环境为多个桶（如prod-1到prod-6），减少部署风险，确保新节点不会立即加载错误配置。引入Chef Summoner服务，基于信号触发Chef运行，替代固定时间表，提升安全性和效率。同时，保留定时任务作为后备方案，防止Summoner故障。未来将推出新的EC2生态Shipyard，支持服务级部署和自动回滚。

slack技术

Deploy Safety: Reducing customer impact from change

Slack在2023年启动部署安全计划，通过自动化检测和回滚机制，将客户影响时长降低90%。针对代码部署引发的事故，团队设定了10分钟内自动修复的目标，并优化了前后端部署流程。采用"广泛试错+聚焦高价值"策略，初期投资多个项目，最终验证自动化回滚效果显著。关键经验包括：容忍滞后指标、高频培训提升团队工具熟练度，以及保持核心指标一致性。未来将持续扩展自动化部署覆盖范围，并探索AI异常检测等新技术。

slack技术

Building Slack’s Anomaly Event Response

Slack推出的异常事件响应（AER）系统，通过实时监控和高级分析，自动识别并终止可疑用户会话，将安全检测与响应时间从数小时缩短至几分钟。AER支持多种威胁检测，如Tor节点访问、数据抓取等，并允许用户自定义配置。该系统采用多层次架构，结合检测引擎、决策框架和响应协调器，确保高效安全防护，助力企业实时应对潜在威胁。

slack技术

Optimizing Our E2E Pipeline

Slack团队发现前端构建在无代码改动时仍频繁执行，浪费大量时间和存储。通过智能判断代码变更（利用git diff）和复用预构建资源（借助S3和内部CDN），成功将构建频率降低60%，单次构建时间从5分钟压缩至2分钟，每月节省数TB存储和数百小时算力，同时意外提升了测试稳定性。这波操作证明：用现有工具深挖流程冗余，能带来显著效能提升！

slack技术

Automated Accessibility Testing at Slack

At Slack, customer love is our first priority and accessibility is a core tenet of customer trust. We have our own Slack Accessibility Standards that product teams follow to guarantee their features…

slack技术

Slack Audit Logs and Anomalies

What are Slack Audit Logs? Like many Software as a Service (SaaS) offerings, Slack provides audit logs to Enterprise Grid customers that record when entities take an action on the platform. For…

slack技术

Astra Dynamic Chunks: How We Saved by Redesigning a Key Part of Astra

Introduction Slack handles a lot of log data. In fact, we consume over 6 million log messages per second. That equates to over 10 GB of data per second! And it’s all stored using Astra, our in-house,…

slack技术

We’re All Just Looking for Connection

We’ve been working to bring components of Quip’s technology into Slack with the canvas feature, while also maintaining the stand-alone Quip product. Quip’s backend, which powers both Quip and canvas, is written in Python. This is the story of a tricky bug we encountered last July and the lessons we learned along the way about being careful with TCP state. We hope that showing you how we tackled our bug helps you avoid — or find — similar bugs in the future!

slack技术

Advancing Our Chef Infrastructure

At Slack, we manage tens of thousands of EC2 instances that host a variety of services, including our Vitess databases, Kubernetes workers, and various components of the Slack application. The majority of these instances run on some version of Ubuntu, while a portion operates on Amazon Linux. With such a vast infrastructure, the critical question arises: how do we efficiently provision these instances and deploy changes across them? The solution lies in a combination of internally-developed services, with Chef playing a central role. In this blog post, I’ll discuss the evolution of our Chef infrastructure over the years and the challenges we encountered along the way.

slack技术

Unified Grid: How We Re-Architected Slack for Our Largest Customers

All software is built atop a core set of assumptions. As new code is added and new use-cases emerge, software can become unmoored from those assumptions. When this happens, a fundamental tension arises between revisiting those foundational assumptions—which usually entails a lot of work—or trying to support new behavior atop the existing architecture. The latter approach is usually advised, to save time and reduce risk.

However, there are times when it’s worth revising the core architecture of a large software application. Recently at Slack we did just that, taking a step back to change how our backend and clients (the desktop and mobile applications) work on a foundational level.

slack技术

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

Slack Data Engineering recently migrated their data workload from EMR 5 to EMR 6, using Spark 3 as the processing engine. The migration aimed to improve performance, enhance security, and achieve cost savings. They faced challenges related to supporting the same Hive catalog, provisioning different EMR clusters, controlling costs, and supporting different versions of job libraries. They used various tools and techniques like the Hive Schema Tool, Bazel, and the Airflow Spark operator to address these challenges. The migration allowed them to leverage the benefits of Spark 3 and improve their data processing capabilities. They also performed post-migration data validation to ensure an exact data match between the tables and made use of Trino and their in-house Python framework for detailed analysis. They continuously monitored the runtime of their pipelines and made necessary adjustments.

slack技术

Proactive Measures Against Password Breaches and Cookie Hijacking

Slack采取主动措施和创新的自动化技术，保护用户免受潜在的侵犯。当Slack的Cookie失效时，与之关联的会话将被标记为终止，完成后用户将被注销出他们的工作空间。这对于保护用户的账户免受未经授权的访问是一件好事，但我们也知道在关键对话或在会议中演示时，没有人希望失去对Slack的访问。因此，在运行时，我们的自动化会检查每个受损的Cookie，评估关联用户的地理位置是否意味着在他们通常的工作时间内。如果是这样，该特定Cookie的失效将安排在工作时间范围之外的时间窗口内，而属于当前不在工作时间内的用户的Cookie将立即失效。这样我们就可以根据每个用户的时区提供积极的用户体验，同时计算出最高效和及时的失效时间，以保护被窃取的Cookie。

slack技术

Catching Compromised Cookies

Slack uses cookies to track session states for users on slack.com and the Slack Desktop app. The ever-present cookie banners have made cookies mainstream, but as a quick refresher, cookies are a…

slack技术