Eli5: How does Google ‘listen’ to you? Is there actually a program running in the background on your phone that records your voice and sends it to google?

407 views

Eli5: How does Google ‘listen’ to you? Is there actually a program running in the background on your phone that records your voice and sends it to google?

In: 2057

13 Answers

Anonymous 0 Comments

Speech-to-text is computationally intensive. Listening all the time for any and all possible words on a phone would drain the battery too quickly, as your phone would essentially be “always” thinking hard.

However, it’s a lot less work to just try to scan for the *one* short sound you’re explicitly looking for, and ignore everything that’s not a close match. “Did he say a word? Well it wasn’t the one I’m looking for so I don’t care. I won’t spend any time thinking about it.”

And it takes even less work to just recognize “there was a new sound of some sort just now.” That just means looking for spikes in volume.

So there’s a principle in computing, when doing short-circuit evaluation, you check the very quick stuff first and skip the more complex check if the quick check proved the complex check can’t be true.

So in this case that means you do things in this order:

Step 1 – Is there even a speech-like rise in volume at all?
(If “no”, then abort.)

Step 2- Did it match the short wakeup sound I’m looking for?
(If “no”, then abort.)

Step 3 – Okay, now start engaging the computationally expensive speech-to-text algorithm to look at the rest of the sound.

So the hours your phone sits in your pocket and the average volume of the room doesn’t really rise at all, your phone just sits on Step 1 over and over. “No rise in volume? Okay How about now? No? Okay how about now?” This doesn’t take much thinking at all. There’s a little bit of “smoothing” logic in there so sharp sounds like knocking on a table don’t wake it up. It has to get a volume rise that has a bit of a duration to count as being “maybe speech”.

Then there’s chatter where people in the room say “Did you see that ludicrous display last night?” “What was Wenger thinking sending Walcott on that early?” And this seems to fit the volume pattern of speech so that time around your phone gets past Step 1 and gets to Step 2. Was there anything in there that sounded like the wakeup sound “Ok Google,”, (or “Alexa,” or whatever your service uses as its wakeup sound.) When the answer is no, then it still doesn’t bother engaging Step 3 yet. It knows that “Did you see that ludicrous display last night?” doesn’t contain “Ok, Google,”, but it doesn’t know what it *does* contain. It doesn’t know that it contained the word “ludicrous” for example. It just knows it didn’t contain “Ok, Google” so it ignored it.

Then someone says, “Ok, Google, pizza delivery near me”, and THIS time, it gets all the way to Step 3. It had the sustained volume pattern that speech has. It had the magic sound “Ok, Google”, so THEN it started running the speech-to-text algorithm, which is expensive and power-hungry, to work out the “pizza delivery near me” part of it. It’s power-hungry, but it only needs to do it for a few seconds and then it’s done, rather than leaving it on at all times.

To make all this possible, it also has a small rolling audio buffer that keeps the last several seconds of audio. It needs that because by the time it decides your sounds are worthy of speech-to-text, you’ve already said them a second ago. They’re in the past.

As to why it seems like it’s listening to you all the time in a creepy way, that’s because it’s *really good* at guessing from other context clues (in a way that really is creepy). Let me give an example. Me and a group of strangers were sitting at a table in a gaming store. We didn’t know each other and were there waiting for an event to start. We had no prior social contact, no facebook links or anything like that. We were talking about movies. The subject of The Aadams Family came up. We talked about how good the child actor who played Wednesday was at nailing the role. Then we moved on to other stuff. At no point did I google anything about it. At no point did I take my phone out of my pocket. And the subject of “Wednesday Aadams” wasn’t a thing I had mentioned or looked up for years. But later that evening, there it was in my auto-complete as soon as I typed a “w” into google’s search bar, the first autocomplete suggestion from just typing “w” was “Wednesday Aadams”, and I was like, “WHAAA?” It sure sounds like it’s listening in all the time, otherwise how would it know the subject was ever even mentioned? Well the answer was almost even worse than that. No matter how much you try to turn off “location tracking” various services keep insisting on having it on to work at all. I hadn’t typed anything about “Wednesday Aadams” at all, but one of the other strangers at the table had searched for “who played Wednesday Aadams” on *their* phone, and then someone else at the table looked up the IMDB page for her. What google had done is used the location tracking to conclude “Two people who weren’t you performed google searches on the same subject during the time that they were in very close proximity to you. You remained in proximity to them for about a half hour. So you were probably having a conversation with them about it.”

For another example of a creepy but useful thing the location tracking does – it’s the reason Google Maps instantly knows where the traffic jams are, and reports the real current travel times along the roads, not the hypothetical ‘speed limit’ travel times. It’s because lots of the people in those cars happen to have Android phones, and their location tracking is on. Google deduces the traffic speed by watching those phones move.

You are viewing 1 out of 13 answers, click here to view all answers.