Go's concurrency model makes it easy to develop scalable servers and data pipelines. Many of the patterns we use in developing concurrent code mirror structures in real-world systems. In this talk, Sameer presents a simulation of a small real world system and shows how variations in the design impact the system's performance.
In 2009, I fell in love. I fell in love with Go's concurrency model at a tutorial taught by Rob Pike. You see, I had been programming distributed systems in C++ and so I thought of concurrency in terms of callbacks and locks and thread pools. Go introduced me to a new way of thinking about my programs. For example, I had a couple of libraries for reading and writing data in replicated bigtables. I was able to convert 700 lines of twisty C++ to just 100 lines of Go. The Go code made the structure of my program easier to understand, So I was able to generalize the algorithm and combine multiple C++ libraries into a single Go library. After that experience, I began contributing to the Go libraries inside Google and in 2011 joined the team to work on Go full time. But while my work has focused on scaling Go to larger systems and teams, my love of the language centers around its concurrency model. Today, I want to talk about how studying real world systems can help up become better Go programmers.
Go was my first practical introduction to CSP, the Communicating Sequential Processes model. For me, the real charm in this model is how it mirrors the real world. I live in New York City, a place that’s constantly striving to scale with increasing population and tourism. I see concurrency and communication in the myriad systems that keep the city running.
I ride the New York subway to work, And I’m reminded daily of this system’s limited capacity and variable latency.
Every Election Day, I’m reminded of the massive parallel computation we run to sum up a few very important numbers. We see concurrency and contention throughout this process, and repeated aggregation and communication as we work towards the final totals.
But most of all, I see concurrency in services. In systems that serve requests, that need to deliver results quickly and reliably, handle the incoming load gracefully, and scale up to meet the city's ever-growing demands. Today I'm going to present a simulation written in Go of one real-world system, a coffee shop. I created this simulation to explore a few properties of services: latency, throughput, contention, and utilization. I evaluated several implementations of the coffee shop in Go, each one mapping to a slightly different real-world scenario. I hope to show you the power in Go’s model and what we can learn about system dynamics with this simulation.
The simulator a kind of benchmark harness that exercises these various implementations and measures their performance. This diagram shows the goroutines and channels that form the harness. The first stage generates load, in our case, customers for the coffee shop. The second stage executes the function being measured, in our case, an “order coffee & wait” function. This stage measures how long the function takes to execute and reports that duration to the final goroutine, which aggregates the results. We vary the number of goroutines in the second stage, which you can think of as the number of people requesting a coffee simultaneously. We'll vary the function being executed by the workers to explore the design space and find the best-performing coffee shop.
Our “order coffee & wait” function makes a latte. In our simulation, this happens in four steps: Step 1: Grind the coffee beans Step 2: Make espresso using the grounds Step 3: Steam the milk Step 4: Combine the espresso and milk to make the latte By default, each of these steps uses CPU for 1 millisecond, so if we do the four steps in sequence, to total time to prepare the latte is 4 milliseconds.
This code shows the idealBrew function, which runs the four steps and returns the latte. The first step grinds the coffee using the grinder and returns the grounds. The second step prepares the coffee using the espresso machine and the grounds. The third step steams the milk using the steamer. And the final step makes the latte using the coffee and milk.
These two charts show the performance of our coffee shop as we vary the number of workers from 1 to 6. In the ideal implementation, we can prepare as many lattes in parallel as our CPUs can handle. So with one CPU, we can prepare 1 latte in 4 milliseconds. Our throughput—or rather “brewput”—is 250 lattes per second. This is the starting point on the lefthand chart. Throughput scales linearly with more CPUs. So with 6 CPUs, we get 1500 lattes per second. On the throughput chart, higher values are better. The righthand chart shows the median time required to make a latte—the “cafe au lait-ency”—which stays flat at 4 milliseconds. On this chart, lower values are better. We’ll return to charts like these throughout the talk to compare our implementations.
Unfortunately, our ideal coffee shop isn't too realistic. In a real coffee shop, each of the first three steps requires a specific machine: a grinder for grinding coffee beans, an espresso machine for making espresso, and a steamer for steaming milk. In the simulation code, these machines are real data structures that track the latency for each stage. Just as it would not work for multiple people to use the same machine simultaneously, it is a race for multiple goroutines to call “add” on the same machine data structure simultaneously.
If we run the ideal implementation with multiple CPUs and enable the race detector, we get a runtime error indicating the data race. So our first lesson is: remember to test with the race detector! But you all knew that already. There’s another useful tip here: if you create your own simulation in Go, you can use the race detector to check that you’re synchronizing access to shared resources properly.
So how can we prevent multiple goroutines from accessing the same machine simultaneously? One way would be to put a lock on the entire set of machines. This would be like putting all the coffee machines in a small kitchen and only allowing one person in at a time. The code on the right implements this whole-kitchen-locking scenario.
When we measure this scenario, we find the exact opposite of the ideal scenario. Throughput stays flat at 250 lattes per second. And latency grows linearly with CPUs. Why? Because there is no way for more than one person to make a coffee at a time. Each additional CPU is another person waiting in line to make their coffee. So when the Nth person joins the line, they must wait 4 milliseconds for each of the preceding N-1 people. Of course, we know that we can do better than this, because we see better in the real world.
In the real world, different people can use the different machines simultaneously, as long as the kitchen is big enough. One person can use the grinder while another makes espresso, and a third can steam their milk. Now, this isn't a traditional coffee shop, with baristas behind the counter. This is more like a self-serve coffee kitchen, where anyone can use any machine. We'll come back to the barista model later.
In Go, we can implement this scenario using a mutex for each machine. A goroutine locks the grinder mutex, uses the grinder, then unlocks it; Then it locks the espresso machine mutex, and so on. So now, instead of locking the whole kitchen for 4 milliseconds, we're just locking each of the three machines for 1 millisecond each. The fourth phase, making the latte with coffee and milk, doesn’t need any locks at all.
Now, our throughput and latency curves look more interesting as we add CPUs. Up to four CPUs, throughput grows linearly and latency stays flat, just like in our ideal implementation. But with the fifth and sixth CPUs, throughput stays flat and latency starts increasing. What's happening here?
When there's just three people in the kitchen, each takes their turn at a machine and moves on. Beyond 3, each additional person needs to wait their turn, because all the machines are in use. Notice, though, that latency is increasing much more slowly than in the kitchen-locking scenario. This is because the pipeline of people making coffee is advancing each millisecond.
We enabled greater parallelism and increased throughput by minimizing the time we held the lock on any one resource. We are also avoiding holding any locks during the fourth phase, As there’s no contention on any shared resource in that step.
But why does throughput flatten after 4 CPUs?
To understand what’s happening, let’s model how this coffee-making schedules on to CPUs.
The first coffee runs on CPU1, taking 1 millisecond per stage. The second coffee waits 1 millisecond to use the grinder, then can proceed in parallel on CPU2. The third coffee waits again for the grinder then proceeds on CPU3. And same for the fourth coffee on CPU4.
But what about the fifth? By the time the grinder is free, CPU1 is free again, so it can run there. It can’t possibly start sooner because the grinder is fully utilized. This pattern continues indefinitely. There is no way for this system to use more than 4 CPUs in parallel because of contention on the grinder. That’s why the throughput of this system flattens at 1000 lattes per second: 4 CPUs running at 250 lattes per second.
So given this structural limit, how can we increase the performance of our system? Let's think about the real world again. If you wanted to make a coffee and saw people lining up to use the machines, what would you do? Well, it depends on how long the line is. You might just give up, but let’s assume you really need that coffee. The problem is that our critical resources, the three machines, are running at capacity. You’d probably find another place to get your coffee. This suggests a remedy: if there were a second set of machines, or a second coffee shop, more people could make coffee simultaneously.
What happens if we double our machines, so we have two grinders, two espresso machines, and two steamers? Let’s simulate this and find out. One way we can implement this scenario in Go is by creating a buffered channel of size 2 for each machine type and putting 2 machines in each channel. Now, instead of locking a mutex, a goroutine receives from the channel to get access to 1 of the 2 machines. When it's done with the machine, the goroutine sends the machine back on the channel.
With two sets of machines, we see ideal performance up to 6 CPUs. Throughput increases linearly, and latency stays flat.
Here’s another way to compare the performance of the implementations we’ve seen so far. The chart on the left shows the throughput of each implementation when running with 6 CPUs. The chart on the right shows their latency distributions: The rectangles indicate the middle 50% of latencies, while the vertical lines indicate the middle 90%. For example, the locking scenario has a throughput of 250 lattes per second, And latencies between 22 and 25 milliseconds, with most between 23 and 24.
You can see that whole-kitchen locking has much lower throughput and much higher latency than our ideal implementation. The fine-grain locking implementation does better, peaking at around 1000 lattes per second due the the structural limits we saw earlier. We overcome this structural limit by adding more capacity, that is, more coffee machines. With two of each machine, we achieve ideal performance. Doubling again to four of each machine gains us nothing, since we’ve already maxed out our 6 CPUs. Now, our CPUs are the limiting factor, so to increase performance further, we’ll need more CPUs.
So far, we’ve simulated a shared coffee kitchen, in which each person making coffee takes turns using a shared set of machines. But that’s not what we see in the real world. In the real world, we usually see a small number of baristas operating the machines. This is more efficient because there are fewer people moving around the kitchen, And the baristas operate the machines more quickly than the average customer.
The next simulation I’ll present is a coffee assembly line, in which one person operates each machine. The first person operates the grinder and passes the grounds to the second person, who makes the espresso, and so on. Let’s see how this is implemented in Go.
Finally, we have some goroutines and channels! In this pipeline, we have three stages: a grinder, a presser, and a steamer. The grinder receives new orders on the orders channel, grinds the beans, adds the grounds to the order, and passes it along to the next stage. The presser makes the espresso using the grounds and passes the coffee along to the steamer. The steamer steams the milk and passes it back to the goroutine waiting on the order. That goroutine combines the coffee and milk to make the latte.
When we look at the performance of this pipeline (the green line), it looks a lot like the fine-grain locking implementation. The latency is identical, but the throughput is slightly less until we reach 6 CPUs. Why? The latency is the time it takes to run each of the four stages, so it makes sense that that would remain the same in the locking and pipeline implementations. But the pipeline’s throughput is less because it is not utilizing the three contended machines as efficiently. Here’s the real-world analogy: Consider what happens when the person using the grinder finishes grinding the beans. In the locking implementation, that person steps away from the grinder and starts waiting for the espresso machine. Someone else can start using the grinder immediately. But in the pipeline implementation, the person grinding beans has to wait to hand off the grounds to the person making espresso before they can start the next grind. If the second person is busy making espresso, the first person just stands there, holding out the grounds. This is a blocked channel send in our implementation. This leaves the grinder idle, underutilizing our resources. Of course, in the real world, the person would just set the grounds down on the counter and start grinding more beans for the next coffee. That counter space allows for the different stages of the pipeline to stay busy even if they’re a little out of sync. And in the real world, where grinding beans or making espresso might take more or less time due to natural variance, we need that flexibility. So how do we model that counter space in Go? By adding buffers to the channels that connect our pipeline stages. These buffers absorb the variance between stages and so increase throughput and utilization.
I tested pipelines with buffer size 1 and size 10. Their lines are overlapping on the charts. Both of these buffered pipelines outperform the unbuffered pipeline, Achieving the optimal throughput of 1000 lattes per second. This is because they eliminate the requirement for stages to proceed in sync and so allow more work to proceed in parallel.
While we see a benefit with buffering, it’s important to keep these gains in perspective. These charts compare all the implementations we’ve seen, plus two “multipipe” scenarios. The “multipipe” scenarios run multiple copies of our buffered coffee pipelines, much like the “multi” scenarios used multiple copies of the various machines. You can think of “multipipe-2” as a world with 2 coffee shops, next door to each other.
What we see in these charts is that the differences between fine-grain locking and the various pipelines are all pretty minor. The two big gains came from structural changes. The first was moving from the whole-kitchen lock to fine-grain, per-machine locks. This reduced the time spent in any one critical section, so that more work could run in parallel. The second was recognizing when our existing resources were fully utilized and adding more capacity. This allowed us to max out our 6 CPUs. The CPUs are now our limiting resource, so if we want to get more throughput, we’ll need to add more CPUs to run our simulation.
In preparing for this talk, I tried many more scenarios than the ones I’ve shown you so far. I tried changing the number of stages, changing their duration, and adding random noise. I tried making the steam-milk stage run in parallel with the other stages. What’s remarkable is how little any of that mattered. Most changes had only small effects on performance, But the structural changes provided major gains.
The lesson is to identify and remove the structural barriers to parallelism your system. Removing these barriers will help your system scale. We did this today by reducing the time spent in critical sections. We did this again by adding more replicas of contended resources. And we did this with buffering, which allowed upstream pipeline stages to proceed without blocking on downstream stages. While we’ve focused on the benefits of these changes, remember that they also have costs, such as increased resource use.
I also encourage you to take inspiration from real world systems. Try to understand why they are the way they are. These insights will help you understand structural performance issues And help you discover new designs.
One last thing— I encourage you all to download the simulator and play with it! The code is straightforward, less than a thousand lines of Go for everything. Try some more scenarios. Dig into the results using Go’s profiling tools. Or perhaps try modeling a new real-world system in Go. You’ll learn more about how that system works, and you’ll learn a lot about Go itself. Thank you.