ELIF: How exactly does a computer transcribe interviews – what enables the audio to text conversion?

751 views

Title is pretty self explanatory.

In: Technology

4 Answers

Anonymous 0 Comments

often someone listens to it and types it out. it’s not often done by computer. There are tons of online jobs where you do exactly this.

source:

i watch my roommate do this for a living

Anonymous 0 Comments

The computer has a massive library of spoken words and text so it has a general idea of what a particular word is supposed to sound like. It will take a recorded bit of audio and look at how closely it matches other audio samples in its library. Based on that it will make a best guess at what the word is, but it’s not always accurate.

Just to be clear, the computer isn’t actually matching your speech with a giant library of sounds. It effectively takes a bunch of speech saying the same word and creates a math formula that will take your speech and give a % chance that it is a match. It will then use that % chance to make a best guess at what you’re saying.

I’m glossing over a lot of details here, but this is the essence of machine learning and why companies like Amazon and Google are so interested in recording your conversations and gathering data on you. They use it to train computers to do things like this.

Anonymous 0 Comments

We’ve gotten very good (through AI learning, and having humans manually translate) at teaching computers what sounds translate to what words. It’s gotten good enough that computers can now translate on the fly. Try enabling closed captioning in your next google hangout to see it in action. The computer knows what sound mean what words, and then passes through a couple of other filters to provide context and make it even more accurate.

Anonymous 0 Comments

The computer will take an audio file and split the waveforms into small pieces and try to figure out what phoneme is being said (A phoneme is a primitive unit to represent words). Then it tries to guess what word is being said out of those phonemes, like “T O M EY T OW” (Tomato) out of a giant database.