SIMD to my understanding is supposed to accelerate 3D Processing and other multimedia processing. GPUs do the exact same thing but instead they’re dedicated into doing that role. I heard SIMD is an early attempt to make 3D Processing possible on CPUs
“GPU is a compute device which implements a SIMD (Single Instruction Multiple Data) in a much more multi-threaded fashoin, in fact SIMD is coined as SIMT (Single Instruction, Multiple Thread) in a GPU. So basically GPU is a an extension of the SIMD paradigm with large scale multi-threading, streaming memory and dynamic scheduling”
-Gary Cole, Quora
Unless I’m wrong. And there are many other things SIMD can be useful for other than Multimedia use and can help with traditional CPU tasks like integer operations but I have not found any info of SIMD being used outside of Multimedia use. It could do operations related to Audio which is very handy but couldn’t they instead be handled by a Digital Signal Processing?
My understanding of computer science is limited so I may gather a ton of knowledge to learn with this post
In: Technology
SIMD is a super if quasi-parallel computing technology upgrade for CPU’s. It’s quasi because the basic idea with parallel computing is theoretical infinite parallel processors, but in fact you are dealing with hundreds or thousands.
This is one way to make programs faster.
So to answer your question, when you want to sired up a computer, it’s better to have multiple options and technology than just two (CPU and GPU).
Moreover, if you have two GPU, that’s better than one, right? Even though they do the same thing.
There is quite a bit overhead in transferring work/data between the GPU and CPU. There are many cases where the overhead of setting up the GPU to do the work will outweigh the savings, and this is even more likely if you need to get the results back to the CPU afterwards. Doing the work directly on the CPU bypasses this overhead, though there may be some other minor overhead due to alignment or register requirements, this overhead is extremely limited. A lot of this overhead comes from having to transfer the data across the GPU pipe, which is generally *much* slower than the CPU memory. This means that you ideally want the bulk of the data to get transferred once and reused – think data like textures or meshes, which are loaded once and reused across many hundreds of frames.
Additionally, the GPU is optimized to do extremely large chunks of identical data processing. Most GPUs run big blocks, such as 32*, *identical* operations at once, and sometimes more, where only input values can change. That is, if you tell the GPU to run a multiply, its going to run 32 multiplies with 32 different inputs – even if you only have a single input. If you want to run 33 operations, it will end up running 64 operations. When the number of operations is expected to be off, this wasted work can be a net negative. For comparison, the typical SIMD instruction on the CPU works on merely 4 inputs, though there is some variance.
The composite of these means there is still a lot of benefit of running smaller operations directly on the CPU. This even applies in cases where you need to do some moderate-sized bulk operations, but will need to completely swap around the data or there is a lot of complex data. In most cases, the overhead is only worthwhile in cases such as physics processing, graphics processing, and *some* cases of encryption or compression. Most cases of encryption and compression are better done on the CPU, due to how they use memory/data, but still benefit from SIMD calculations.
There has also been some recently movement towards supporting hardware loading, which is mostly useful for game loading, but can see benefit in other applications. The idea here is to provide methods by which data can be read off disk into GPU memory, decompressed, with the decompressed data written back into GPU memory, completely bypassing the bulk of the CPU overhead in the process.
TLDR: Good usage of the GPU requires more engineering work and applying specific design constraints. These are not always worth the costs involved, even if the calculation could benefit.
* The exact value depends on the exact GPU and configuration. A minimum block size of 32 is pretty common, while 1024 is often the highest supported size.
GPUs are insane at crunching humongous data sets.
But doing work on a GPU requires copying data into (then out of) its memory. This takes time, there’s a lot of overhead.
It’s often much faster to do smaller datasets on the CPU. Small enough that the time to do the work on the CPU might be less than the time it takes to even write/read data to the GPU. (Because there is overhead to even talk to the GPU)
Accelerating those tasks with SIMD makes them faster still.
The less time you have waiting on memory transfers and GPUs the better.
Communicating with a GPU and running programs on it entials a lot of latency. Just running a GPU program that does nothing takes around 10 microseconds, which is a decent length of time when you consider that we’re talking about devices that run through a few billion cycles per second. That’s more than 30,000 CPU cycles just for a GPU program that does nothing! That’s not even counting the actual communication with the CPU to dispatch the work and then read the results back.
SIMD instructions have very little overhead by comparison since it’s not fundamentally different from executing other CPU instructions. Yes, these instructions are often longer and heavier, but when the throughput gain can be 2, 4, 8, 16, 32, or even 64 times what you get when not using SIMD, this increased overhead easily gets drowned out. This makes SIMD very useful for solving problems that have parallism at a very small scale where GPU programming, or even multithreading might not lead to a performance increase.
As far as this bit:
> Unless I’m wrong. And there are many other things SIMD can be useful for other than Multimedia use and can help with traditional CPU tasks like integer operations but I have not found any info of SIMD being used outside of Multimedia use.
SIMD can totally be used for purposes beyond multimedia applications, and more modern SIMD instructions sets are increasingly being designed for general-purpose computing and for specific applications like machine learning and cryptography. In fact, I have a hobby project that is all about facilitating the use of SIMD instructions for general-purpose tasks. I’m very often able to get substantial improvements in performance by using them for tasks that they normally aren’t used for. e.g I recently made some functions that can compute the remainders of division for floating-point numbers at a throughput that is ~200 times greater than non-SIMD code.
Other answers about the cost of copying data are on point, but you should also understand that CPUs and GPUs are designed for different kinds of programs.
CPUs are optimized for ‘sequential’ programs, where there is very little work available to do in parallel. GPUs are the opposite, they are built for programs with TONS of available parallelism.
What that means is a GPU is really really really bad at running sequential programs. You want those on the CPU. But sometimes, within the sequential program, there are short bursts of parallelism. It would be nice to be able to run those as efficiently as possible. As others have said, copying the data to a GPU is too expensive. So SIMD instructions on CPUs help you get some performance at low cost.
Note that, historically, many SIMD machines were big vector processors that are much closer to modern GPUs in their design and target applications than current CPUs. GPUs are technically ‘SIMT’ processors, but the distinction between SIMD and SIMT is really inconsequential for your question, and you can accurately think of a GPU as a SIMD processor.
Latest Answers