Noisy Neighbor Detection with eBPF
The sched_wakeup and sched_wakeup_new hooks are invoked when a process changes state from 'sleeping' to 'runnable.' They let us identify when a process is ready to run and is waiting for CPU time. During this event, we generate a timestamp and store it in an eBPF hash map using the process ID as the key.
struct { __uint(type, BPF_MAP_TYPE_HASH); __uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u32));
__uint(value_size, sizeof(u64));
} runq_enqueued SEC(".maps");
SEC("tp_btf/sched_wakeup")
int tp_sched_wakeup(u64 *ctx){
struct task_struct *task = (void *)ctx[0];
u32 pid = task->pid; u64 ts = bpf_ktime_get_ns(); bpf_map_update_elem(&runq_enqueued, &pid, &ts, BPF_NOEXIST);
return 0;
}
Conversely, the sched_switch hook is triggered when the CPU switches between processes. This hook provides pointers to the process currently utilizing the CPU and the process about to take over. We use the upcoming task's process ID (PID) to fetch the timestamp from the eBPF map. This timestamp represents when the process entered the queue, which we had previously stored. We then calculate the run queue latency by simply subtracting the timestamps.
SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx){
struct task_struct *prev = (struct task_struct *)ctx[1];
struct task_struct *next = (struct task_struct *)ctx[2]; u32 prev_pid = prev->pid; u32 next_pid = next->pid; u64 *tsp = bpf_map_lookup_elem(&runq_enqueued, &next_pid);
if (tsp == NULL) {
return 0; } u64 now = bpf_ktime_get_ns(); u64 runq_lat = now - *tsp; bpf_map_delete_elem(&runq_enqueued, &next_pid);
....
One of the advantages of eBPF is its ability to provide pointers to the actual kernel data structures representing processes or threads, also known as tasks in kernel terminology. This feature enables access to a wealth of information stored about a process. We required the process's cgroup ID to associate it with a container for our specific use case. However, the cgroup information in the process struct is safeguarded by an RCU (Read Copy Update) lock.
To safely access this RCU-protected information, we can leverage kfuncs in eBPF. kfuncs are kernel functions that can be called from eBPF programs. There are kfuncs available to lock and unlock RCU read-side critical sections. These functions ensure that our eBPF program remains safe and efficient while retrieving the cgroup ID from the task struct.
void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;
u64 get_task_cgroup_id(struct task_struct *task)
{
struct css_set *cgroups;
u64 cgroup_id; bpf_rcu_read_lock(); cgroups = task->cgroups; cgroup_id = cgroups->dfl_cgrp->kn->id; bpf_rcu_read_unlock();
return cgroup_id;
}
Once the data is ready, we must package it and send it to userspace. For this purpose, we chose the eBPF ring buffer. It is efficient, high-performing, and user-friendly. It can handle variable-length data records and allows data reading without necessitating extra memory copying or syscalls. However, the sheer number of data points was causing the userspace program to use too much CPU, so we implemented a rate limiter in eBPF to sample the data.
struct { __uint(type, BPF_MAP_TYPE_RINGBUF); __uint(max_entries, RINGBUF_SIZE_BYTES);
} events SEC(".maps");
struct {
__uint(type, BPF_MAP_TYPE_PERCPU_HASH); __uint(max_entries, MAX_TASK_ENTRIES);
__uint(key_size, sizeof(u64));
__uint(value_size, sizeof(u64));
} cgroup_id_to_last_event_ts SEC(".maps");
struct runq_event {
u64 prev_cgroup_id; u64 cgroup_id; u64 runq_lat; u64 ts;};
SEC("tp_btf/sched_switch")
int tp_sched_switch(u64 *ctx){ u64 prev_cgroup_id = get_task_cgroup_id(prev); u64 cgroup_id = get_task_cgroup_id(next); u64 *last_ts = bpf_map_lookup_elem(&cgroup_id_to_last_event_ts, &cgroup_id);
u64 last_ts_val = last_ts == NULL ? 0 : *last_ts;
if (now - last_ts_val < RATE_LIMIT_NS) {
return 0;
}
struct runq_event *event;
event = bpf_ringbuf_reserve(&events, sizeof(*event), 0);
if (event) {
event->prev_cgroup_id = prev_cgroup_id; event->cgroup_id = cgroup_id; event->runq_lat = runq_lat; event->ts = now;
bpf_ringbuf_submit(event, 0);
bpf_map_update_elem(&cgroup_id_to_last_event_ts, &cgroup_id, &now, BPF_ANY); }
return 0;
}
Our userspace application, developed in Go, processes events from the ring buffer to emit metrics to our metrics backend, Atlas. Each event includes a run queue latency sample with a cgroup ID, which we associate with containers running on the host. We categorize it as a system service if no such association is found. When a cgroup ID is associated with a container, we emit a percentile timer Atlas metric (runq.latency) for that container. We also increment a counter metric (sched.switch.out) to monitor preemptions occurring for the container's processes. Access to the prev_cgroup_id of the preempted process allows us to tag the metric with the cause of the preemption, whether it's due to a process within the same container (or cgroup), a process in another container, or a system service.
It's important to highlight that both the runq.latency metric and the sched.switch.out metrics are needed to determine if a container is affected by noisy neighbors, which is the goal we aim to achieve — relying solely on the runq.latency metric can lead to misconceptions. For example, if a container is at or over its cgroup CPU limit, the scheduler will throttle it, resulting in an apparent spike in run queue latency due to delays in the queue. If we were only to consider this metric, we might incorrectly attribute the performance degradation to noisy neighbors when it's actually because the container is hitting its CPU quota. However, simultaneous spikes in both metrics, mainly when the cause is a different container or system process, clearly indicate a noisy neighbor issue.
Below is the runq.latency metric for a server running a single container with ample CPU capacity. The 99th percentile averages 83.4µs (microseconds), serving as our baseline. Although there are some spikes reaching 400µs, the latency remains within acceptable parameters.