Welcome to Lesson 3 in our “Lessons from Our Voice Engine” series, featuring high level insights from our Engineering and Speech Tech teams on how our voice engine works. This lesson is from Siva Reddy Gangireddy, a Senior Speech Recognition Scientist on our Speech Tech team.
To understand deep learning, we need a basic understanding of machine learning.
Machine learning is a group of algorithms that focus on learning from data to make predictions and decisions without any explicit programming. It usually involves training a model on huge amounts of data to learn patterns so that predictions and decisions can then be made on new data. For example, the smart speakers we use in daily life are based on machine learning algorithms.
Deep learning is a form of machine learning that’s based on neural networks, a set of algorithms designed to mimic the function of the human brain. Any network with more than three layers is considered a deep neural network and the input is processed through those several layers to predict the desired output. Deep neural networks require huge amounts of data and are extensively used in speech recognition and image recognition. At SoapBox Labs, our models are trained on thousands of hours of audio data and evaluated on in-house datasets regularly.
The goal of speech recognition is to convert users’ speech to text. Given the variations in audio data (such as pronunciation, accent and noise), machine learning algorithms are used to ensure accuracy. Because of its superior performance, especially for understanding kids’ variable speech, deep learning is at the core of SoapBox Labs’ voice engine and solutions like fluency assessments. We also use deep learning to deliver wake word detection, voice activity detection (VAD), and end-to-end speech recognition for on-device speech recognition.
Continue to Lesson #4 on debiasing, or catch up on our previous “Lessons from Our Voice Engine”:
© SoapBoxLabs. 2021 – All Rights Reserved