One optimization that can make a difference is making sure that you lay out and process your data in an order that is cache-friendly. You would want to try to always access data that is close together in memory, so that you’re using cached data instead of loading from RAM each time. Even the order of the for loops can affect this
The naive method (2 loops) do more addition and multiplications than actually necessary, and this is a mathematical fact, so regardless of hardware or languages, it’s always slow. More clever method rearrange it so that you mix up some additions operations before multiplications to reduce calculations.
As an analogy, imagine you want to calculate ac+ad+bc+bd. The naive direct method requires 4 multiplications and 3 additions, but by knowing that this is the same as (a+b)(c+d), you can reduce to 2 additions and 1 multiplications.
But in addition to purely mathematical improvement, you can also optimize performance based on hardware. Specifically, parallel computing. Matrix multiplications involves a lot of calculations that are independent: computations that do not depend on the result of each other, and hence can be done at the same time in parallel, rather than waiting for one to finish before the next. However, a computer might not be able to figure out that your computations can be done in parallel (especially if you use an imperative language instead of a functional language); you need to tell it as such, and then it knows to make use of the hardware for parallel computing. Even better, since matrix multiplication is such a common operation, many computer have hardware specifically designed for it, so that it can do a large amount of parallel computing when it comes to this specific task.
Finally, there is also the issue of caching. For a lot of modern computer, the slowness of computations come from the time it takes to retrieve information from the standard memory, RAM. To improve performance on that, the computer made use of cache, a form a memory that it can access much quicker. However, space of cache is limited, so if it runs out of cache and it needs to store more data, it will have to store some data on cache back to RAM and acquire new data from RAM into cache. Because of this, it’s inefficient to force the computer to keep changing the number it’s computing on. Instead, it’s a lot more efficient for the computer to use the same numbers repeatedly in a row before finally being forced to change. An optimized algorithm would make sure to rearrange the order the computations are being done so that the computer use the same number in a large continuous block of time.
Latest Answers