Why do programs have to be manually optimized for multiple CPU cores? Why is single-core performance such a bottleneck?


For a long time, single core performance has been the most important feature for gaming. Though we are getting better multi-threaded games, we are still pushing for the maximum single core performance instead of cores. Why can’t 16* 2ghz cores do roughly as good job as 8* 4ghz (of the same CPU architecture), especially in gaming?

They say that software programmers have to manually split the jobs for multiple cores. Why? Why does the OS even need to operate multiple cores in the first place? To me this sounds like bad workplace management, where the results depend on pushing the limits of the same few people (cores) instead of splitting the work. I feel like making just a big bunch of cheap cores would give better performance for the money than doing tons of research for the best possible elite cores. This works for encoding jobs but not for snappy game performance.

Now, one limitation that comes to mind is sequential jobs. Things where the steps need to be done in a certain order, depending on the results of the previous step. In this case, higher clock speed has an advantage and you wouldn’t even be able to utilize multiple cores. However, I still feel like the clock speeds like 4 000 000 000 cycles per second can’t be the limiting factor for running a game over 150 frames per second. Or is it? Are the CPU jobs in game programming just so sequential? Is there any way to increase the speed of simple sequential jobs with the help of more cores?

Bonus question: How do multiple instructions per cycle work if a job is sequential?

Bonus question 2: GPUs have tons of low power cores. Why is this okay? Is it just because the job of rendering is not sequential at all?

In: Engineering

It’s hard to convey to someone just how difficult multi-core programming is if they don’t have a strong programming background.

> instead of splitting the work

And therein lies the rub: splitting work across cores is extremely difficult. Without programmer assistance the CPU cannot meaningfully understand the program structure to extract significant work.

> Are the CPU jobs in game programming just so sequential?

To put it simply, yes. Working across multiple cores is immensely difficult for the bulk of the work a game does.

> Is there any way to increase the speed of simple sequential jobs with the help of more cores?

This depends entirely on the specific details of the jobs.

> How do multiple instructions per cycle work if a job is sequential?

Each individual instruction engages different parts of the CPU at different times. So you can have to instructions “executing” simultaneously so long as they are using different parts of the CPU.

> GPUs have tons of low power cores. Why is this okay? Is it just because the job of rendering is not sequential at all?

Correct. Graphics rendering is a whole lot of “do this operation on every pixel” and each operation is independent of all others. You can have a thousand running simultaneously without much work by the programmer.

Imagine you’re changing all the tires on a car.

With just one person, it takes forever. With two people, you can work on two tires at once. But why stop there? If we throw 16 or 64 people at it we can change it faster than an Indy 500 team, right? Well not really. You’re still limited by your equipment, just like single-core performance. It doesn’t matter how many people you have or how fast the other steps are if your jack is a slow scissor jack. And you need a human to decide which tasks get grouped in parallel so you don’t have one guy trying to screw lug nuts on a new tire before another guy gets the old tire off.

However, having help on each tire can speed things up, but not because you’re creating more core teams to handle more groups of tasks, but because you’ve got a pipeline so one person can be readying the next task while the last person is finishing his old task.

(There are actually some attempts to let the CPU decide what can be run in parallel in a pipeline instead of the programmer through techniques like hyperthreading. But this happens per instruction, and may affect data in the instructions right before or after the hyperthreaded line, which makes this practice a bad candidate for parallelization across cores)

It’s also worth pointing out here that not only is upgrading equipment (like a hydraulic jack) faster, it’s also more efficient. Even if you could change four tires with 100 low-paid guys using cheap equipment, imagine the amount of body heat that mosh pit crew would create.

But what if we had to do a really repetitive job, like buffing and polishing the car. If we got 50 people to polish at once, we’d get done way faster than a skilled team of three people. This is the idea behind GPUs: that certain tasks like graphics can have dozens or thousands of the same calculation done in parallel.

Short answer, because multithreaded programming is *extremely* hard and often complicates things.

You can have threads that overlap on the same job.

You can have threads that wait upon each other to finish a task before continuing, in and endless loop (known as a deadlock)

You can have threads that have access to the same data and both try to modify it at the same time, leading to a “race condition”

You can have starved threads resulting from having tons of threads with low resources.

Predictability is a problem with multithreaded programs. You have no idea how the program will execute in a multithreaded program, so you in turn have no idea how threads will act with data and if/when they will cause the issues above.

All of these things require different algorithms and features to resolve. Yes, overall it is faster but only if done correctly, so it’s more desirable to have faster cores to limit the need of multi threading.


You are building a house. You have 2 builders, who are experienced and work well together. The house is taking too long, so why not hire 4 more to speed it up?

These builders don’t know who is working on what. They start both trying to take the same brick at the same time, so now the building is halted until one of them concedes the brick.

The builders start to wait on one person to finish the frame, but the person building the frame is waiting for the other builders to finish their part.

And now, with more builders, you have 2 builders sitting around doing nothing because there’s not enough bricks anymore.

To solve it, you hire a manager to manage the site. He keeps tabs on the builders and decides who does what and adds systems to prevent conflict.

The builders are threads. The bricks are processes. The manager is the process management system.

Really simply put – if you have a set of instructions that can be followed by one person, you can’t always split those instructions so that two people work on different bits of it at the same time, you have to write the instructions specifically so they can be shared out. Then you have to worry about whether or not certain things can be done before certain other things have finished.

Bonus Question 2: GPUs have the job of calculating an array of data to figure out what the screen needs to show. The math is relatively straightforward, and it can be done in parallel because the combination of each of those pixel calculations builds the image, so its best to split that pixel processing over as many cores as possible.

Cinebench runs are a good way of seeing this in practice, even though its a CPU benchmarking software. The benchmark splits the image processing task over as many threads as there is available. So with more cores, you can run more parts of the image simultaneously, and because the parts are independent of each other, you aren’t waiting for adjacent parts to process the next one. Consistently, higher core count CPUS have the best runs despite being slower, because they can process more at once.

GPUs work similarly, the CPU has told the GPU what to do based on all it’s calcs, so the GPUs job is to process the image. More cores makes this job faster.

Before anyone jumps in, yes the GPU is more complex, i dumbed it down significantly.

>Bonus question 2: GPUs have tons of low power cores. Why is this okay? Is it just because the job of rendering is not sequential at all?

This shouldn’t be surprising. A lot of the work that could be easilly parallelized is already heavily parallelized in the GPU, and as GPU’s become more general purpose they will absorb more of the parallelizable work besides rendering, so in the end what’s left for the CPU is the part that’s not so easy or not possible to parallelize, this already answers part of the question.

>Bonus question: How do multiple instructions per cycle work if a job is sequential?

It can happen in a mostly sequential program that there are many small operations where the order doesn’t matter, so there’s potential for a degree of micro-parallelism. For example, if you had something like x = 2+2, y = 4+5, z = x+y, then you first need x and y to calculate z, but x and y could be computed in parallel within a single core.

But ironically when you want to share variables between threads you may lose some of this optimizations in order to ensure data consistency between them. Sharing data makes multithreading harder but can’t always be avoided. When there’s a single thread the number of possibilities is limited and under the single thread assumption modern compilers can rearange and optimize the code to make it perform better, but when there are multiple threads the developers must be very specific about how they share data and interact which each other.