In this website [here](https://www.tesla.com/AI), Tesla states:
“Our per-camera networks analyze raw images to perform semantic segmentation, object detection and monocular depth estimation.”
How does monocular depth estimation with cameras work? Humans use two eyes to estimate depth (binocular depth), so how is Tesla FSD able to estimate distance with just one camera?
In: 0
Humans only use binocular vision for depth estimation out to a few meters. Basically for working with things with your hands. Beyond that the brain is using other context clues, such as the expected size of an item, to estimate depth.
For example if you see a stapler and it looks to be the size of your pinky held at arm’s length then your brain will interpret it to be 5 meters or so away.
But if the stapler was giant, and 100 meters away, you would incorrectly judge the distance (in the absence of other contact clues) as 5 meters even though it is much further.
Tesla probably used a similar algorithm.
Googling “monocular depth estimation” produces tons of papers on the topic.
Latest Answers