Nvidia’s RT and Tensor Cores.

1.43K views

Hello, I’m a person who likes Hardware quite a lot and always tries to keep informed about it, and these couple of years with Nvidia’s new technologies I’ve kind of been struggling to fully understand how these new types of cores work.

I understand the basic concept, RT cores work for Ray Tracing, and Tensor cores for Deep Learning Super Sampling, but I want to understand what makes these cores better at their job different to normal CUDA or AMD’s Stream Processors (I know they’re quite different but understand that they act similarly).

I’ve tried to read but have come in contact with things like:

* “4 x 4 matrices”
* “4 x 4 FP 16/FP 32 matrices”

And I have no idea what that means, I think it’s ways of doing calculations and math, but I’m not sure. That’s specific for Tensor cores though, not RT cores, but I’m a lot more interested in Tensor cores to be honest because I’ve been seeing how it has evolved in DLSS 2.0 and it has come a HUGE way to DLSS 1.0, probably outperforming now most types of AA available right now. (Although I know it’s an upscaling tool other than AA, or I think that’s what it was)

So basically, could someone explain to someone who doesn’t understand much “Computer math” a simpler way to understand **WHY** these cores are best at what they specifically do and **HOW** they do it?

Thanks a ton! Hope this explains well what I wanted to know ^^.

In: Technology

3 Answers

Anonymous 0 Comments

Sometimes you just need one number to say what you mean, e.g you have 3 apples. There’s only one dimension of numerical information there.

In 3D computer graphics you need 3 dimensions of numerical information. You need numbers for the X, Y, and Z axis. So any point in the 3D space is represented by a set of 3 numbers, like e.g (1,2,3).

But whole numbers (integers) like those aren’t precise (accurate) enough so you need to use floating point numbers (numbers with decimal places). And those are much slower to work with than integers.

It starts getting complicated when you need to work with groups of points in 3D space. Like e.g a triangle will need 3 groups of 3 number groups. One number group for each corner and 3 corners to describe the triangle. So you end up needing a data structure that represents that. That’s where matrices comes in.

The simplest data structure is a list (called an array). It might look like (1,2,3). A matrix is a more complex type of data structure. It’s basically a table or a list of lists. You could use it to represent a triangle in 3D space. A 3×3 (3 elements of 3 dimensions) matrix might look like :

1,1,1
5,1,4
1,5,4

Imagine those numbers are actually floating points and you need to multiply that matrix by another matrix. It’s going to take a lot of work. That’s where tensor cores comes in. They’re designed to multiply matrices made of floating point numbers and they do it multiple times faster than the ‘old-fashioned’ CUDA cores. That’s all they do. They have a single purpose and so they can be made to operate very quickly.

Thinking of triangles in a 3×3 matrix is a simplification. Without getting too complicated you actually need 4×4 matrices. But a matrix is a multi-dimensional data structure and a tensor core multiplies them together very quickly.

Thinking of tensor cores multiplying matrices that represent triangles is a simplification too because they don’t actually have access to all of the information required for that. That’s why they’re used to speed up the DLSS process, but that process involves multiplying matrices as well. (Which represent pixels and motion).

So, tensor cores are better at multiplying matrices because that’s all they were designed to do (that’s a bit reductive but it’s essentially the reason). And how they multiply matrices *differently and better* than previous methods involves their design not needing to make use of as many operations (like reading from and writing to memory, registers, and caches).

Each operation takes time and energy and if some specialised piece of silicon is designed to only work with data given to it in a certain way (in a matrix), and is only expected to do one thing (multiply those matrices), it can be highly optimised for that task. As well as that, the tensor cores are not just multiplying one matrix against another. You can feed in up to 256 groups of 3 4×4 matrices in each clock cycle. Each tensor core is chomping on matrices.

You are viewing 1 out of 3 answers, click here to view all answers.