To Thread or Not to Thread: An In-Depth Look at Ruby’s Execution Models

Deploying Ruby applications using threaded servers has become widely considered as standard practice in recent years. According to the 2022 Ruby on Rails community survey, in which over 2,600 members of the global Rails community responded to a series of questions regarding their experience using Rails, threaded web servers such as Puma are by far the most popular deployment target. Similarly when it comes to job processors, the thread-based Sidekiq seems to represent the majority of the deployments. 

In this post, I’ll explore the mechanics and reasoning behind this practice and share knowledge and advice to help you make well-informed decisions on whether or not you should utilize threads in your applications (and to that point—how many). 

Why Are Threads the Popular Default?

While there are certainly many different factors for threaded servers' rise in popularity, their main selling point is that they increase an application’s throughput without increasing its memory usage too much. So to fully understand the trade-offs between threads and processes, it’s important to understand memory usage.

Memory Usage of a Web Application

Conceptually, the memory usage of a web application can be divided in two parts.

Two separate text boxes stacked on top of each other, the top one containing the words "Static memory" and the bottom containing the words "Processing memory".

Static memory and processing memory are the two key components of memory usage in a web application.

The static memory is all the data you need to run your application. It consists of the Ruby VM itself, all the VM bytecode that was generated while loading the application, and probably some static Ruby objects such as I18n data, etc. This part is like a fixed cost, meaning whether your server runs 1 or 10 threads, that part will stay stable and can be considered read-only.

The request processing memory is the amount of memory needed to process a request. There you'll find database query results, the output of rendered templates, etc. This memory is constantly being freed by the garbage collector and reused, and the amount needed is directly proportional to the amount of thread your application runs.

Based on this simplified model, we express the memory usage of a web application as:

processes * (static_memory + (threads * processing_memory))

So if you have only 512MiB available, with an application using 200MiB of static memory and needing 150MiB of processing memory, using two single threaded processes requires 700MiB of memory, while using a single process with two threads will use only 500MiB and fit in a Heroku dyno.

Two columns of text boxes next to each other. On the left, a column representing a single process shows a box with the text "Static Memory" at the top, and two boxes with the text "Thread #2 Processing Memory beneath it. In the column on the right which represents two single threaded processes, there are four boxes, which read: "Process 1 Static Memory", "Process #1 Processing Memory", "Process #2 Static Memory", and "Process #2 Processing memory" in order from top to bottom.

A single process with two threads uses less memory than two single threaded processes.

However this model, like most models, is a simplified depiction of reality. Let’s bring it closer to reality by adding another layer of complexity: Copy on Write (CoW).

Enter Copy on Write

CoW is a common resource management technique involving sharing resources rather than duplicating them until one of the users needs to alter it, at which point the copy actually happens. If the alteration never happens, then neither does the copy.

In old UNIX systems of the ’70s or ’80s, forking a process involved copying its entire addressable memory over to the new process address space, effectively doubling the memory usage. But since the mid ’90s, that’s no longer true as, most, if not all, fork implementations are now sophisticated enough to trick the processes into thinking they have their own private memory regions, while in reality they’re sharing it with other processes.

When the child process is forked, its pages tables are initialized to point to the parent’s memory pages. Later on, if either the parent or the child tries to write in one of these pages, the operating system is notified and will actually copy the page before it’s modified.

This means that if neither the child nor the parent write in these shared pages after the fork happens, forked processes are essentially free.

A flow chart with "Parent Process Static Memory" in a text box at the top. On the second row, there are two text boxes containing the text "Process 1 Processing Memory" and "Process 2 Processing Memory", connected to the top text box with a line to illustrate resource sharing by forking of the parent process.

Copy on Write allows for sharing resources by forking the parent process.

So in a perfect world, our memory usage formula would now be:

static_memory + (processes * threads * processing_memory)

Meaning that threads would have no advantage at all over processes.

But of course we're not in a perfect world. Some shared pages will likely be written into at some point, the question is how many? To answer this, we’ll need to know how to accurately measure the memory usage of an application.

Beware of Deceiving Memory Metrics

Because of CoW and other memory sharing techniques, there are now many different ways to measure the memory usage of an application or process. Depending on the context, some metrics can be more or less relevant.

Why RSS Isn’t the Metric You’re Looking For

The memory metric that’s most often shown by various administration tools, such as ps, is Resident Set Size (RSS). While RSS has its uses, it's really misleading when dealing with forking servers. If you fork a 100MiB process and never write in any memory region, RSS will report both processes as using 100MiB. This is inaccurate because 100MiB is being shared between the two processes—the same memory is being reported twice.

A slightly better metric is Proportional Set Size (PSS). In PSS, shared memory region sizes are divided by the number of processes sharing them. So our 100MiB process that was forked once should actually have a PSS of 50MiB. If you’re trying to figure out whether you’re nearing memory exhaustion, this is already a much more useful metric to look at because if you add up all the PSS numbers you get how much memory is actually being used—but we can go even deeper.

On Linux, you can get a detailed breakdown of a process memory usage though cat /proc/$PID/smaps_rollup. Here’s what it looks like for a Unicorn worker on one of our apps in production:

And for the parent process:

Let’s unpack what each element here means. First, the Shared and Private fields. As its name suggests, Shared memory is the sum of memory regions that are in use by multiple processes. Whereas Private memory is allocated for a specific process and isn’t shared by other processes. In this example, we see that out of the 771,912 kB of addressable memory only 437,928 kB (56.7%) are really owned by the Unicorn worker, the rest is inherited from the parent process.

As for Clean and Dirty, Clean memory is memory that has been allocated but never written to (things like the Ruby binary and various native shared libraries). Dirty memory is memory that has been written into by at least one process. It can be shared as long as it was only written into by the parent process before it forked its children.

Measuring and Improving Copy on Write Efficiency

We’ve established that shared memory is a key to maximizing efficiency of processes, so the important question here is how much of the static memory is actually shared. To approximate this, we compare the worker shared memory with the parent process RSS, which is 508,544 kB in this app, so:

worker_shared_mem / master_rss >>(18288 + 315648) / 508544.0 * 100 >>65.66

Here we see that about two-thirds of the static memory is shared:

A flow chart depicting worker shared memory, with Private and Parent Process Shared Static Memory in text boxes at the top, connecting to two separate columns, each containing Private Static Memory and Processing Memory.

By comparing the worker shared memory with the parent process RSS, we can see that two thirds of this app’s static memory is shared.

If we were looking at RSS, we’d think each extra worker would cost ~750MiB, but in reality it’s closer to ~427MiB, when an extra thread would cost ~257MiB. That’s still noticeably more, but far less than what the initial naive model would have predicted.

There’s a number of ways an application owner can improve CoW efficiency with the general idea being to load as many things as possible as part of the boot process before the server forks. This topic is very broad and could be a whole post by itself, but here are a few quick pointers.

The first thing to do is configure the server to fully load the application. Unicorn, Puma, and Sidekiq Enterprise all have a preload_app option for that purpose. Once that’s done, a common pattern that degrades CoW performance is memoized class variables, for example:

Such delayed evaluation both prevents that memory from being shared and causes a slowdown for the first request to call this method. The simple solution is to instead use a constant, but when it’s not possible, the next best thing is to leverage the Rails eager_load_namespaces feature, as shown here:

Now, locating these lazy loaded constants is the tricky part. Ruby heap-profiler is a useful tool for this. You can use it to dump the entire heap right after fork, and then after processing a few requests, see how much the process has grown and where these extra objects were allocated.

The Case for Process-Based Servers

So, while there are increased memory costs involved in using process-based servers, using more accurate memory metrics and optimizations like CoW to share memory between processes can alleviate some of this. But why use process-based servers such as Unicorn or Resque at all, given the increased memory cost? There are actually advantages to process-based servers that shouldn’t be overlooked, so let’s go through those. 

Clean Timeout Mechanism

When running large applications, you may run into bugs that cause some requests to take much longer than desirable. There could be many reasons for that—they might be specifically crafted by a malicious actor to try to DOS your service, or they might be processing an unexpectedly large amount of data. When this occurs, being able to cleanly interrupt this request is paramount for resiliency. Process-based servers can kill the worker process and fork a fresh one to replace it, ensuring the request is cleanly interrupted.

Threads, however, can’t be interrupted cleanly. Since they directly share mutable resources with other threads, if you attempt to kill a single thread, you may leave some resources such as mutexes or database connections in an unrecoverable state, causing the other threads to run into various unrecoverable errors. 

The Black Box of Global VM Lock Latency

Improved latency is another major advantage of processes over threads in Ruby (and other languages with similar constraints such as Python). A typical web application process will do two types of work: CPU and I/O. So two Ruby process might look like this:

Two rows of text boxes, containing separate boxes with the text "IO", "CPU", and "GC", representing the work of processes in a Ruby web application.

CPU and IOs in two processes in a Ruby application.

But in a Ruby process, because of the infamous Global VM Lock (GVL) , only one thread at a time can execute Ruby code, and when the garbage collector (GC) triggers, all threads are paused. So if we were to use two threads, the picture may instead look like this:

Two rows of text boxes, with the individual boxes containing the text "CPU", "IO", "GVL wait" and GC", representing the work of threads in a Ruby web application and the latency introduced by the GVL.

Global Volume Lock (GVL) increases latency in Ruby threads.

So every time two threads need to execute Ruby code at the same time, the service latency increases. How much this happens varies considerably from one application to another and even from one request to another. If you think about it, to fully saturate a process with N threads an application only needs to spend less than 1 / N of its time waiting on I/O. So 50 percent I/O for two threads, 75 percent I/O for four threads, etc. And that’s only the saturation limit, given that a request’s use of I/O and CPU is very much unpredictable, an application doing 75 percent I/O with two threads will still frequently wait on the GVL.

The common wisdom in the Ruby community is that Ruby applications are relatively I/O heavy, but from my experience it’s not quite true, especially once you consider that GC pauses do acquire the GVL too, and Ruby applications tend to spend quite a lot of time in GC.

Web applications are often specifically crafted to avoid long I/O operations in the web request cycle. Any potentially slow or unreliable I/O operation like calling a third-party API or sending an email notification is generally deferred to a background job queue, so the remaining I/O in web requests are mostly reasonably fast database and cache queries. A corollary is that the job processing side of applications tends to be much more I/O intensive than the web side. So job processors like Sidekiq can more frequently benefit from a higher thread count. But even for web servers, using threads can be seen as a perfectly acceptable tradeoff between throughput per dollar and latency. 

The main problem is that as of today there isn’t really a good way to measure how much the service latency is impacted by the GVL, so service owners are left in the dark. Since Ruby doesn’t provide any way to instrument the GVL, all we’re left with are proxy metrics, like gradually increasing or decreasing the number of threads and measuring the impact on the latency metrics, but that’s far from enough.

That’s why I recently put together a feature request and a proof of concept implementation for Ruby 3.2 to provide a GVL instrumentation API . It's a really low-level and hard to use API, but if it’s accepted I plan to publish a gem to expose simple metrics to know exactly how much time is spent waiting for the GVL, and I hope application performance monitoring services include it.

Ractors and Fibers—Not a Silver Bullet Solution

In the last few years, the Ruby community has been experimenting heavily with other concurrency constructs to potentially replace threads, known as Ractors and Fibers.

Ractors can execute Ruby code in parallel, rather than having one single GVL, each Ractor has its own lock, so they theoretically could be game changing. However Ractors can’t share any global mutable state, so even sharing a database connection pool or a logger between Ractors isn’t possible. That’s a major architectural challenge that would require most libraries to be heavily refactored, and the result would likely not be as usable. I hope to be proven wrong, but I don’t expect Ractors to be used as units of execution for sizable web applications any time soon. 

As for Fibers, they’re essentially lighter threads that are cooperatively scheduled. So everything said in the previous sections about threads and the GVL applies to them as well. They’re very well suited for I/O intensive applications that mostly just move bytes streams around and don’t spend much time executing code, but any application that doesn’t benefit from more than a handful of threads won’t benefit from using fibers.

YJIT May Change the Status Quo

While it’s not yet the case, the advent of YJIT may significantly increase the need to run threaded servers in the future. Since just-in-time (JIT) compilers speedup code execution at the expense of unshareable memory usage, JITing Ruby will decrease CoW performance, but will also make applications proportionally more I/O intensive.

Right now, YJIT only offers modest speed gains, but if in the future it manages to provide even a two times speedup, it would certainly allow application owners to ramp up their number of web threads by as much to compensate for the increased memory cost.

Tips to Remember

Ultimately choosing between process versus thread-based servers involves many trade-offs, so it’s unreasonable to recommend either without first looking at an application’s metrics.

But in the abstract, here are a few quick takeaways to keep in mind: 

  • Always enable application preloading to benefit from CoW as much as possible. 

  • Unless your application fits on the smallest offering or your hosting provider, use a smaller number of larger containers instead of a bigger number of smaller containers. For instance a single 4CPU 2GiB of RAM box is more efficient than 4 boxes with 1CPU 512MiB of RAM each.

  • If latency is more important to you than keeping costs low, or if you have enough free memory for it, use Unicorn to benefit from the reliable request timeout. 

  • Note: Unicorn must be protected from slow client attacks by a reverse proxy that buffer requests. If that’s a problem, Puma can be configured to run with a single thread per worker.

  • If using threads, start with only two threads unless you’re confident your application is indeed spending more than half its time waiting on I/O operations. This doesn’t apply to job processors since they tend to be much more I/O intensive and are much less latency sensitive, so they can easily benefit from higher thread counts. 

Looking Ahead: Future Improvements to the Ruby Ecosystem

We’re exploring a number of avenues to improve the situation for both process and thread-based servers.

First, there’s the GVL instrumentation API mentioned previously that should hopefully allow application owners to make more informed trade-offs between throughput and latency. We could even try to use it to automatically apply backpressure by dynamically adjusting concurrency when GVL contention is over some threshold.

Additionally, threaded web servers could theoretically implement a reliable request timeout mechanism. When a request takes longer than expected, they could stop forwarding requests to the impacted worker and wait for all other requests to either complete or timeout before killing the worker and reforking it. That’s something Matthew Draper explored a few years ago and that seems doable.

Then, the CoW performance of Ruby itself could likely be improved further. Several patches have been merged for this purpose over the years, but we can probably do more. Notably we suspect that Ruby’s inline caches cause most of the VM bytecode to be unshared once it’s executed. I think we could also take some inspiration from what the Instagram engineering team did to improve Python’s CoW performance . For instance they introduced a gc.freeze() method that instructs the GC that all existing memory regions will become shared. Python uses this information to make more intelligent decisions around memory usage, like not using any free slots in these shared regions since it’s more efficient to allocate a new page than to dirty an old one.

Jean Boussier is a Rails Core team member, Ruby committer, and Senior Staff Engineer on Shopify's Ruby and Rails infrastructure team. You can find him on GitHub as @byroot or on Twitter at @_byroot.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by design.

- 위키
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-09 03:43
浙ICP备14020137号-1 $방문자$