# How does “monocular depth estimation” work in Tesla FSD beta?

85 views

In this website [here](https://www.tesla.com/AI), Tesla states:

“Our per-camera networks analyze raw images to perform semantic segmentation, object detection and monocular depth estimation.”

How does monocular depth estimation with cameras work? Humans use two eyes to estimate depth (binocular depth), so how is Tesla FSD able to estimate distance with just one camera?

In: 0

Humans only use binocular vision for depth estimation out to a few meters. Basically for working with things with your hands. Beyond that the brain is using other context clues, such as the expected size of an item, to estimate depth.

For example if you see a stapler and it looks to be the size of your pinky held at arm’s length then your brain will interpret it to be 5 meters or so away.

But if the stapler was giant, and 100 meters away, you would incorrectly judge the distance (in the absence of other contact clues) as 5 meters even though it is much further.

Tesla probably used a similar algorithm.

Googling “monocular depth estimation” produces tons of papers on the topic.

Humans only use binocular vision for depth estimation out to a few meters. Basically for working with things with your hands. Beyond that the brain is using other context clues, such as the expected size of an item, to estimate depth.

For example if you see a stapler and it looks to be the size of your pinky held at arm’s length then your brain will interpret it to be 5 meters or so away.

But if the stapler was giant, and 100 meters away, you would incorrectly judge the distance (in the absence of other contact clues) as 5 meters even though it is much further.

Tesla probably used a similar algorithm.

Googling “monocular depth estimation” produces tons of papers on the topic.

Humans also use monocular depth estimation, because our binocular vision is only effective out to about 16 feet. Beyond that, the distance between our eyes is insignificant compared to the distance of the object and our brains can no longer determine any real difference between the two images.

Now, monocular depth estimation uses various clues to determine relative distance. One clue is obscuration: objects that obscure other objects are closer. Another clue is relative size: things that are bigger are presumed to be closer. Related to that, we know how big certain objects tend to be and can estimate based upon how big we expect certain objects to be. A third clue is detail: the more detail we perceive, the closer something is. A fourth is relative motion: closer objects show greater change in position/seem to move faster than further ones. These are just some of the ones I remember, but they pretty much cover all the bases.

Humans also use monocular depth estimation, because our binocular vision is only effective out to about 16 feet. Beyond that, the distance between our eyes is insignificant compared to the distance of the object and our brains can no longer determine any real difference between the two images.

Now, monocular depth estimation uses various clues to determine relative distance. One clue is obscuration: objects that obscure other objects are closer. Another clue is relative size: things that are bigger are presumed to be closer. Related to that, we know how big certain objects tend to be and can estimate based upon how big we expect certain objects to be. A third clue is detail: the more detail we perceive, the closer something is. A fourth is relative motion: closer objects show greater change in position/seem to move faster than further ones. These are just some of the ones I remember, but they pretty much cover all the bases.

Humans only use binocular vision for depth estimation out to a few meters. Basically for working with things with your hands. Beyond that the brain is using other context clues, such as the expected size of an item, to estimate depth.

For example if you see a stapler and it looks to be the size of your pinky held at arm’s length then your brain will interpret it to be 5 meters or so away.

But if the stapler was giant, and 100 meters away, you would incorrectly judge the distance (in the absence of other contact clues) as 5 meters even though it is much further.

Tesla probably used a similar algorithm.

Googling “monocular depth estimation” produces tons of papers on the topic.