In present days it’s mostly like it happens in our brains: flow of sounds separated into set of different frequencies, then it’s feeded to neural network, which remembers how each word sounds like (or, to be specific, how it’s looks like a sequence of sets of frequencies). Neural network can compress, scale or transform sounds, so it can recognize words even if there’s some noise on record, or if it’s said with different pitch characteristic to person’s voice, or with different tempo.
Latest Answers