In this website [here](https://www.tesla.com/AI), Tesla states:
“Our per-camera networks analyze raw images to perform semantic segmentation, object detection and monocular depth estimation.”
How does monocular depth estimation with cameras work? Humans use two eyes to estimate depth (binocular depth), so how is Tesla FSD able to estimate distance with just one camera?
In: 0
Humans also use monocular depth estimation, because our binocular vision is only effective out to about 16 feet. Beyond that, the distance between our eyes is insignificant compared to the distance of the object and our brains can no longer determine any real difference between the two images.
Now, monocular depth estimation uses various clues to determine relative distance. One clue is obscuration: objects that obscure other objects are closer. Another clue is relative size: things that are bigger are presumed to be closer. Related to that, we know how big certain objects tend to be and can estimate based upon how big we expect certain objects to be. A third clue is detail: the more detail we perceive, the closer something is. A fourth is relative motion: closer objects show greater change in position/seem to move faster than further ones. These are just some of the ones I remember, but they pretty much cover all the bases.
Latest Answers