How can the camera on my phone identify when it’s seeing text or a face or somethingelse?

207 views

I mean, aren’t all things the camera sees are just ones and zeros? How do they program it to understand what’s what?

In: 2

4 Answers

Anonymous 0 Comments

Most modern image recognition is done using machine learning.

The problem with “directly” programming a computer to recognize things like faces is that many of the real-world concepts we take for granted are actually really, really complex. Like, imagine trying to describe what a face looks like to an alien who has never been to Earth or seen a human before. It might go something like this:

**You**: A face has two eyes, two ears, a mouth, and a nose.

**Alien**: I see. What are these “eyes” of which you speak, and how may I come to recognize them?

**You**: Well, an eye has a circular part called an “iris”, which is usually blue, green, or brown in color, surrounded by white stuff.

**Alien**: Fascinating. A human confection called “blueberry parfait” has recently become quite popular on my planet. I had no idea it was made with eyes.

**You**: Uh, no. Although some eyes are blue, they’re never *that* shade of blue. They’re also glossier than blueberries and have wavy patterns running through them.

And so on. With a computer, you would also need to explain what “wavy” and “glossy” mean, exactly what shades of blue are acceptable, etc. And what happens if a face does *not* have two eyes visible, so our first statement doesn’t apply? What if lighting conditions cause eye whites to not actually appear white? Basically, trying to articulate exactly what makes something a face will inevitably lead to a never-ending series of explanations, clarifications, and exceptions, and there’s no way you’ll ever be able to think of everything.

Instead of designing a program specifically to recognize faces, a much better approach is to design a program that can learn how to recognize any kind of object, which you can then *teach* to recognize faces. The way you teach such a program to recognize a specific object is similar to how you would teach a human: by showing it things that *are* examples of that type of object, and things that are *not* examples of that object. In this case, pictures of faces and pictures of not-faces.

As for *how* these programs learn, state-of-the art systems use models called “deep neural networks”. You can think of a neural network as an incredibly complex mathematical formula with millions of unknown values (called “parameters”) to be solved for. At first, all of these parameters are set to random numbers.

The model is trained by feeding it images in batches. For each image, the formula will output a number representing how the model thinks it is that the image is a face. Initially, these outputs will be nonsense because we used random numbers for the parameters. However, using math (specifically calculus), we can work out how to tweak the parameters so that the model’s answers come closer to the correct answers. After being fed enough batches of images, the neural network will eventually learn what makes a face a face.

You are viewing 1 out of 4 answers, click here to view all answers.