BPFAgent:DoorDash的eBPF监控
As DoorDash experienced rapid growth over the last few years, we began to see the limits of our traditional methods of monitoring. Metrics, logs, and traces provide vital information about our service ecosystem. But these signals almost entirely rely on application-level instrumentation, which can leave gaps or conflicting semantics across different systems. We decided to seek potential solutions that could provide a more complete and unified picture of our networking topology.
随着DoorDash在过去几年中快速增长,我们开始看到传统监控方法的局限性。度量标准、日志和跟踪为我们的服务生态系统提供了重要信息。但这些信号几乎完全依赖于应用级别的仪器,这可能导致不同系统之间存在间隙或冲突的语义。我们决定寻找潜在的解决方案,可以提供更完整和统一的网络拓扑图。
One of these solutions has been monitoring with eBPF, which allows developers to write programs that are injected directly into the kernel and can trace kernel operations. These programs, designed to provide lightweight access to most components of the kernel, are sandboxed and validated for safety by the kernel before execution. DoorDash was particularly interested in tracing network traffic via hooks called kprobes (kernel dynamic tracing) and tracepoints. With these hooks, we can intercept and understand TCP and UDP connections across our multiple Kubernetes clusters.
其中一种解决方案是使用eBPF进行监控,它允许开发人员编写直接注入到内核中并可以跟踪内核操作的程序。这些程序旨在提供对内核大多数组件的轻量级访问,它们在执行之前由内核进行沙箱和安全验证。DoorDash特别关注通过称为kprobes(内核动态跟踪)和tracepoints的钩子来跟踪网络流量。通过这些钩子,我们可以拦截和理解我们多个Kubernetes集群中的TCP和UDP连接。
By building at the kernel level, we can monitor network traffic at the infrastructure level, which gives us new insights into DoorDash’s backend ecosystem that’s independent of the service workflow.
通过在内核级别构建,我们可以在基础设施级别监控网络流量,从而为DoorDash的后端生态系统提供独立于服务工作流的新见解。
To run these eBPF probes, we have developed a Golang application called BPFAgent, which we run as a daemonset in all of our Kubernetes clusters. Here we will take a look at how we built BPFAgent, the process of building and maintaining its probes, and how various DoorDash teams have used the data collected.
为了运行这些eBPF探针,我们开发了一个名为BPFAgent的Golang应用程序,我们在所有的Kubernetes集群中作为一个守护进程运行。在这里,我们将看看如何构建...