Nvidia’s RT and Tensor Cores.

1.42K views

Hello, I’m a person who likes Hardware quite a lot and always tries to keep informed about it, and these couple of years with Nvidia’s new technologies I’ve kind of been struggling to fully understand how these new types of cores work.

I understand the basic concept, RT cores work for Ray Tracing, and Tensor cores for Deep Learning Super Sampling, but I want to understand what makes these cores better at their job different to normal CUDA or AMD’s Stream Processors (I know they’re quite different but understand that they act similarly).

I’ve tried to read but have come in contact with things like:

* “4 x 4 matrices”
* “4 x 4 FP 16/FP 32 matrices”

And I have no idea what that means, I think it’s ways of doing calculations and math, but I’m not sure. That’s specific for Tensor cores though, not RT cores, but I’m a lot more interested in Tensor cores to be honest because I’ve been seeing how it has evolved in DLSS 2.0 and it has come a HUGE way to DLSS 1.0, probably outperforming now most types of AA available right now. (Although I know it’s an upscaling tool other than AA, or I think that’s what it was)

So basically, could someone explain to someone who doesn’t understand much “Computer math” a simpler way to understand **WHY** these cores are best at what they specifically do and **HOW** they do it?

Thanks a ton! Hope this explains well what I wanted to know ^^.

In: Technology

3 Answers

Anonymous 0 Comments

First off FP16, FP32, etc. are all referring to floating points. Floating points are one way to store decimals/fractions on computers. Think of it like scientific notation, but with slightly different to work in binary. The number after FP refers to the size of the number. So FP32 is a 32 bit floating point. The more bits you have, the more precise the number.

I am not super familiar with Nvidia’s architectures, but I did glance through the architecture page about Ampere. So most of what I’ll say is generic, but should relate back to what you’ve read about Ampere.

Computers are pretty limited in how they do math. Every operation is limited to a certain number of digits. Think of it like doing math, but each number is limited to 3 digits. When asked to add 1100 + 900, you have to break it into 2 parts. First we do 100 + 900. We get 000 and carry the 1. Then we do 001 + the carried 1. Concatenate the answers from each part and we get 2000. Because we were limited to 3 digit numbers, the operation took twice as long. If we had 6 digits to work with, we could do the operation in one step.

That’s what nvidia has done. They added a lot more FP64 operators. So that you can do 64 bit floating point math more quickly than using 32 bit cores. And there’s a lot more of them, so you can do more operations simultaneously. They also added more 16 and 32 bit arithmetic units. Again, so more operations can be performed simultaneously.

In addition, they’ve added hardware support for matrix math. Matrices are basically tables of numbers. They’re often used to solve systems of algebraic equations (see Matrix Theory), which are very common in AI. The concept is similar to what I mentioned before. What used to take multiple operations to perform can now be done in one.

Anonymous 0 Comments

[Matrices](https://en.wikipedia.org/wiki/Matrix_(mathematics)) are a mathematical tool that can be used for a bunch of stuff.

They are everywhere in machine learning/AI, and quite common in 3D rendering (most transforms in 3D space can be reprensented using a 4×4 matrix).

FP 16 and FP 32 are types of numbers. FP means [floating point](https://en.wikipedia.org/wiki/IEEE_754) which is how decimal numbers are represented (most of the time) by computers. 16 or 32 is the size of these numbers. FP16 is a 16 bit floating point number, and FP32 is a 32 bits floating point number.

So a 4×4 FP32 matrix is a matrix made of 16 (4×4) 32 bits floating point numbers.

> understand WHY these cores are best at what they specifically do and HOW they do it?

Doing ML require doing a **shitton** of matrix math. Tensor cores are basically specialized circuits that do matrix operations. Using dedicated circuits is faster than using general purpose processors^[1] which is why tensor cores make DLSS and other ML based techniques much faster.

**************

[1] There are several reasons why this is the case, one is that integrated circuits can do complex operations “in one go”: they don’t have to read, decode execute and store intermediate results and instructions. An other one is that it frees the general purpose circuits to do other stuff.

Anonymous 0 Comments

Sometimes you just need one number to say what you mean, e.g you have 3 apples. There’s only one dimension of numerical information there.

In 3D computer graphics you need 3 dimensions of numerical information. You need numbers for the X, Y, and Z axis. So any point in the 3D space is represented by a set of 3 numbers, like e.g (1,2,3).

But whole numbers (integers) like those aren’t precise (accurate) enough so you need to use floating point numbers (numbers with decimal places). And those are much slower to work with than integers.

It starts getting complicated when you need to work with groups of points in 3D space. Like e.g a triangle will need 3 groups of 3 number groups. One number group for each corner and 3 corners to describe the triangle. So you end up needing a data structure that represents that. That’s where matrices comes in.

The simplest data structure is a list (called an array). It might look like (1,2,3). A matrix is a more complex type of data structure. It’s basically a table or a list of lists. You could use it to represent a triangle in 3D space. A 3×3 (3 elements of 3 dimensions) matrix might look like :

1,1,1
5,1,4
1,5,4

Imagine those numbers are actually floating points and you need to multiply that matrix by another matrix. It’s going to take a lot of work. That’s where tensor cores comes in. They’re designed to multiply matrices made of floating point numbers and they do it multiple times faster than the ‘old-fashioned’ CUDA cores. That’s all they do. They have a single purpose and so they can be made to operate very quickly.

Thinking of triangles in a 3×3 matrix is a simplification. Without getting too complicated you actually need 4×4 matrices. But a matrix is a multi-dimensional data structure and a tensor core multiplies them together very quickly.

Thinking of tensor cores multiplying matrices that represent triangles is a simplification too because they don’t actually have access to all of the information required for that. That’s why they’re used to speed up the DLSS process, but that process involves multiplying matrices as well. (Which represent pixels and motion).

So, tensor cores are better at multiplying matrices because that’s all they were designed to do (that’s a bit reductive but it’s essentially the reason). And how they multiply matrices *differently and better* than previous methods involves their design not needing to make use of as many operations (like reading from and writing to memory, registers, and caches).

Each operation takes time and energy and if some specialised piece of silicon is designed to only work with data given to it in a certain way (in a matrix), and is only expected to do one thing (multiply those matrices), it can be highly optimised for that task. As well as that, the tensor cores are not just multiplying one matrix against another. You can feed in up to 256 groups of 3 4×4 matrices in each clock cycle. Each tensor core is chomping on matrices.