Petar Maymounkov

The clash between concurrency and parallelism in practice

Thu, Feb 7, 2013

Go's co-creator Rob Pike is, by now, known for emphasizing the distinction between “concurrency” and “parallelism” when it comes to reasoning about program design. Concurrency is a property of the algorithm that you are designing. It determines which parts of your data-processing logic are intrinsically independent (under all inputs and circumstances). Parallelism is a property of the realization of your algorithm. This is not your source code, but the final executable or — even more abstractly — the behavior of your program when executed.

In this context, the holy grail (of software building) is to turn every bit of concurrency into actual parallel computation at execution time. Typically, this task is achieved by a combination of (a) hand-coded explicit implementation of parallel behavior where necessary, as well as (b) smart optimizations by the compiler, linker and/or runtime system.

A key objective in modern language design (at least in a substantial part of application domains) is to completely do away with the need for manual specification of the parallel behavior in the algorithm realization. Instead, it is desired that a programming language provides means for expressing the concurrency in one's algorithms in a way that (a) is natural to reason about as a human, and (b) is independent of the target “physical” device that will realize the computation. This is the “loosely speaking” version of the problem.

The more precise version asks for two things: First, a language that allows for natural implementation of algorithmic ideas and encoding of their concurrencies; and, second, a compiler that can convert a maximum amount of these concurrencies into parallelism on a whole range of different physical devices. In other words, it is desired that the programmer be entirely shielded from the complexities of parallelism, in that the compiler takes upon itself the burden of converting linguistically-described concurrency into concrete physical parallelism. This is a hefty goal, almost definitely not achievable in “full generality”.

This goal certainly comes off as one of the ambitions in the design of the Go Language.

I'd like to make an important remark here: Some amount of concurrency is always possible, but it is not a given; it has to be discovered by the programmer and communicated to the compiler in a way that is workable for both parties. In practice, more often than not, the optimal concurrency is never fully identified, but a sufficiently good chunk of it is. The situation is very much similar to that in Combinatorial Optimzation: Every NP-hard problem has an optimal solution, yet one can have no hope of finding it. Instead, one works hard and settles for their best attempt. The point is that discovering (non-trivial) concurrency is an art and is very much the job of the programmer. The job of the language (and the compiler) is to make it easy for the programmer to communicate their ideas about concurrency, while still being able to compile these ideas efficiently to parallelism.

Go has done a pretty great job at providing a clean metaphor for describing concurrency and separating it as best as possible from parallelism.

Of course, even in Go the separation between parallelism and concurrency is not 100%. I am not aware of a general purpose and/or low-level language that achieves absolute separation. (This is easier to do in domain-specific languages, if the domain is narrow enough.)

What I'd like to do here is show a “corner case” that, in my opinion, illustrates the point of contention between concurrency and parallelism in Go through a simple example. Consider the following elementary function that modifies each value in an in-memory slice.

func job(x []int) {
	for i := 0; i < len(x); i++ {
		x[i] = i
	}
}

This example is less contrived than it looks. Simple in-memory sweeps of giant slices (think 100GB) are quite common in big data analysis and machine learning (also think MapReduce e.g.). The trouble is that sweeping a 100GB slice sequentially, even in-memory, takes more time than you are willing to give. The unique optimal thing to do here is to use all available hardware parallelism (e.g. the number of cores you happen to have) to perform the loop in parallel (since notice that all iterations are independent of each other). How would you implement that?

The crux of the problem is that the amount of available physical parallelism will vary depending on (a) which machine you run your code on and (b) what other things are running on that machine. Forget about the latter altogether. Even with (a) alone we can make our point. Our task can be summarized as follows:

Implement the desired computation so that, for input x of length $n$, when executed on a device with $p$ threads of parallelism (e.g. a CPU with $p$ cores), the computation runs in time $T(n,p)$ where:
  • $T(n,1)$ equals the time to execute the simple for-loop above,
  • $T(n,p)=T(n,1)/p$, and
  • $p$ is not known ahead of time (i.e. $p$ is not known at implementation time).

The only solution to this problem is in fact the (essentially unique) implementation that communicates to the compiler all concurrency inherent in the algorithm (i.e. the fact that each iteration is independent). And here is one clean way to do it in Go (please, educate me if you know better):

func job(x []int) {
	var wg sync.WaitGroup
	wg.Add(len(x))
	for i := 0; i < len(x); i++ {
		go func(i int) {
			defer wg.Done()
			x[i] = i
		}(i)
	}
	wg.Wait()
}

No one would actually write this code. There might be many different slick ways to write this — with fewer words or using channels with showmanship — however the main point is that, as far as I can tell, each one of them is bound to a priori generate as many goroutines as there are elements in the array. This has the practical consequence of requiring an order of magnitude more space for the goroutines' data structures then the input itself — no matter how skinny a goroutine is in memory.

So, the main compromise solution is to do something like this: Make a pool of sufficiently many goroutines, so that their count is more than any reasonable number of cores you might have at your disposable and you can generously multiply this number by 100 in addition. The point being that we will use a constant number of goroutines. An example implementation is illustrated below. You should be able to convince yourself that this solves the problem of efficiency, and on all reasonable hardware it teases out the maximum parallelism.

const MaxNumberOfCores = …

func job(x []int) {
	ch := make(chan int, MaxNumberOfCores)
	var wg sync.WaitGroup
	wg.Add(len(x))
	for i := 0; i < len(x); i++ {
		ch <- i
		go func(i int) {
			defer wg.Done()
			x[i] = i
			<-ch
		}(i)
	}
	wg.Wait()
}

There is one twist though: The goroutine pool we created is not concurrency code. It has nothing to do with our main algorithm. Instead, it is a hack based on hardware assumptions that we introduced to solve the clean solution's space issue. To me, this is not a criticism of Go, because criticism would have to be relative to some other language that is comparable and doesn't have this kink. It is also not a criticism of Go because the “space issue” discussed here is an artifact of the compiler, not the language. (It is conceivable that in the future, a static optimizer can make the first version up there work well.) To me, this is a way of saying

Hey, remembering this simple example is a great aid for being mindful of when and how much parallelism (i.e. non-concurrency) code you introduce in your source as you “go.” (No pun intent :)

Comments? Please join the Google+ discussion. A read-only copy follows: