Voice Thoughts from our COO, Dr. Martyn Farrows

The recent explosion in voice technology has been met with the analogous claim that devices and tools such as smart assistants (e.g. Alexa, Siri, Google Home) are now ‘listening’ to what we are saying. But is the ‘listening’ metaphor accurate or appropriate? Computers don’t have ‘ears’, so their ability to ‘listen’ is somewhat limited. In fact, the way that computers ‘process’ audio is very different to how we process sound as humans. Consequently, we really need to consider voice technology through a different frame of reference.

The way humans listen is one of our many evolutionary wonders. The outer ears – the chunks of bone, cartilage and skin stuck on the sides of our heads – collect sound waves and pass them to the eardrum, creating vibrations which are translated into electrical signals. These electrical signals are picked up by the auditory nerve and sent to the brain which interprets them as sound. That this process takes milliseconds and provides for the parallel processing of multiple sounds on a continuous basis (even when we are asleep) is a miracle of biological and neurological engineering.

This process also highlights a key quality of how we interpret sound – and that is its ephemeral quality, in the sense that unless we use some sort of external recording device there is no permanent record of it (other than our own ability to recall it from memory). This is why, when it comes to speech, we have all sorts of fun phrases that underline it’s fleeting nature – ‘between you and me’, ‘off the record’, ‘one person’s word against the other’, etc.

Then there’s the way we use technology to interpret speech, which is very different to a human ‘listening’. There are microphones to capture the sound waves (the ‘outer ear’), but these sound waves then need to be translated into digital data that a computer (the ‘brain’) can interpret. This is done in a number of steps, first by digitizing the sound waves, then normalizing to take account of background noise, volume, rate of speech – before finally breaking up the signal into tiny segments that can be compared against the phonemes that we use in language to make up words.

To determine what was being said, the computer needs to use powerful and complex statistical models and mathematical functions in almost ‘real-time’. The models themselves need to be ‘trained’ to improve speech recognition accuracy – and often by humans using ears to listen and annotate samples of the recordings (hence the recent controversies about infringements of data privacy).

And this is where it gets interesting -and where speech technology starts to diverge from how we process speech as humans. Whilst some human ‘listening’ is involved to improve the accuracy of speech technology, the real output of the technology process is a transcript -a permanent record of what was (most likely) said. So, rather than being ephemeral, the spoken word is translated into a permanent digital record as a transcription output. In addition to speech, other permanent ‘secondary’ outputs are also possible based on the primary voice recording – such as speaker identification (our voice prints are as unique as our fingerprints), accent, emotions, etc.  

Why is all of this important? Because voice technology enables us to generate permanent voice data from the original recording, which when further analyzed (using techniques such as natural language understanding) can be used to infer meaning, intent and gather detailed personal information about us. For example, every time we use a smart assistant we are using our voice to transmit information about ourselves. Our preferences, likes, behaviors, consumption patterns, voice prints … all incredibly valuable data and a rich source of personal information when aggregated together and bundled with other information such as location, time of day, other users’ voice data.  

The proliferation of smart devices that are collecting voice recordings, combined with massive cloud-based processing power and ever-improving models means that computers are capable of collecting, processing and sharing voice data on an unprecedented scale. Computers may not have ears and they may not be ‘listening’ – but they do have the potential to be incredibly powerful personal data accumulation tools in a world where voice interface is becoming the norm.  

We therefore need a different metaphor to understand how voice technology works. Whilst the utility of voice as an interface is well established, the technology itself is largely misconstrued and misunderstood. In subsequent posts, we will explore the implications of the explosion in voice interface – on our daily lives, digital society, our data privacy – and what this means for how we design and communicate about voice experiences in the future.

Related Post