This glossary is written by Dr. Amelia Kelly, VP of Speech Technology. 

Voice Technology for Education:
A Glossary of Key Terms

Voice technology is growing in power and popularity. Major innovations in voice technology and hardware have led to the widespread adoption of voice or smart assistants in recent years. Everyday, millions of people use voice technology to request their favorite song, check the weather, or call a friend. Indeed, there are over 4 billion digital voice assistants in use in devices today, and, by some estimates, that number will grow to exceed the number of humans on the planet by the year 2024 . . .

It’s clear that voice technology is on the rise, but not just in our homes, cars and workplaces – it’s also becoming more prevalent in the classroom. Voice technology holds the potential to fundamentally transform how kids interact with their learning, and power completely new types of learning experiences. By removing common interface barriers such as typing or touchscreens, voice technology can enable a more authentic relationship between students and content – including pre-literate students or those with disabilities.


The first wave of next generation AI-driven, voice-enabled education technology is already landing in classrooms. Not surprisingly, the solutions recently introduced by Amplify and others have focused on early literacy and fluency: areas that benefit from voice technology’s ability to listen to and understand language, unlocking far more natural and nuanced ways of gauging comprehension than traditional quizzes and tests. Voice technology is also increasingly being leveraged to support the process of “observational assessment,” and to help diagnose reading challenges at an earlier stage and while students are learning at home.


As voice technology begins to emerge as an indispensable tool in the teaching toolkit, it is vital that school and district administrators become more comfortable with the technology and, in particular, its use with children in learning environments. The voice technology that most of us are familiar with – i.e., the technology powering voice assistants like Alexa and Siri – was never designed with children in learning environments in mind, and even less so, children from underrepresented backgrounds.

This glossary of key terms was designed to help educate educators so that they can be more informed consumers of this exciting new technology, understanding its potential to support all learners, both remotely and in the classroom.


Artificial Intelligence

A term that describes systems designed to carry out tasks autonomously, rather than being specifically programmed by humans. AI systems are generally underpinned by machine learning algorithms that “learn” the quickest way to achieve an optimum result.


Machine Learning

A subset of AI that trains computers on large amounts of data so they can carry out simple to complex tasks automatically and at scale. Machine learning algorithms are characterized by “learning” or “improving” with experience to achieve an optimum result.


Deep Learning

Deep learning is a machine learning algorithm based on deep neural networks. Neural networks are extensively used for speech recognition, image recognition, and other pattern recognition problems. Neural networks can scale well with current computation infrastructure and are more accurate than conventional machine learning algorithms. Deep neural networks are a specific type of neural network that require enormous amounts of training data and have a multi-layered architecture that allows them to model complex behaviours, like human speech and language usage.


Voice Technology

An umbrella term that encompasses all of the technologies that allow users to interact with products, services and platforms using their voices. The underlying technologies that enable this are speech recognition (understanding human speech), speech synthesis (computers speaking aloud), natural language processing (reading and understanding human language) and machine translation (converting human speech from one language to another).

In the K-12 edtech context, voice technology is used to enable independent reading practice, language learning, dyslexia screening, learning feedback, summative and formative assessment.


Automatic Speech Recognition/Speech Recognition/Speech-to-Text

Speech recognition (also called speech-to-text) allows digital devices to convert speech into text format. Once speech is transcribed, it is much easier for a device to understand the intent of the speaker. Words or concepts in the text can be used to trigger actions (e.g., turn off the lights, text my sister), or within the educational context, can be compared against a rubric to determine reading fluency or comprehension, for example. Other data measurements can also be returned to the user by speech recognition systems. They are capable of detecting when a user has started talking, for example. In an educational context, STT technology can provide time stamps for individual words, making it easy for a teacher to listen back to a particular word or phrase a student uttered as part of a reading assessment. These systems can also return confidence scores (pronunciation scores) at the utterance, word, and even the phoneme level. 


Natural Language Processing

A subfield of AI that focuses on the interaction between computers and humans using human language; specifically, the ability of machines to derive meaning from language.



Intentional processes, such as utilizing a varied, diverse and proportionately representative dataset of voices to train machine learning algorithms, used to reduce or remove the presence of unintended bias in voice technologies. 

Artificial intelligence systems will reflect the conscious and unconscious biases of their creators and create poor – and often prejudicial – user experiences for underrepresented users. Machine learning algorithms are unique in that they carry out decisions based on what they’ve seen within a supplied dataset, rather than being explicitly programmed using a predetermined set of rules. Building a system based largely on data from one demographic will result in accurate performance of the speech recognition system for that sub-group, but inaccurate performance for all others. 

A biased system can amplify and propagate deep-seated prejudices held by the designers of that system, be they explicit or unintended. The effects of such biases in the context of educational technology, assessment platforms and learning tools for kids can be disastrous. For example, if a biased system fails to understand a child’s accent or dialect, it can consistently tell that  child he or she is a poor reader when in fact they are reading correctly. An unbiased system, on the other hand, can offer fair and uncompromised information to facilitate ed-tech platforms and services. AI companies need to make a concerted effort to de-bias their technology. To mitigate bias and show transparency, companies could, for example, self-report accuracy against publicly available diverse data sets representing as many demographics as possible.


Voice-enabled Assessment

An alternative to manual assessment that uses speech recognition technology to listen, identify and assess learning invisibly while the child is reading out loud. Voice technology is a powerful assessment tool for use in the classroom and remotely and can provide information that aids with pronunciation assessment, oral reading fluency assessment and dyslexia screening. It offers scalability and objectivity as well as the potential to enhance multiple forms of assessment from language learning, oral fluency and pronunciation to multiple choice or reading comprehension. 

In the context of assessments, voice technology can return a range of key data points that can be used to support and improve educational outcomes for children, as well as  helping to determine the type and level of support provided by teachers. Aggregate data of this nature also assists edtech companies in developing more effective learning tools for children. 


Data points that could help facilitate voice-enabled assessments include:

  • Transcription – text of what was spoken
  • Utterance/Phoneme/Word confidence – confidence score which which a keyword or phrase was pronounced
  • Start and end time – which can facilitate listening back to audio segments 
  • Reading time duration – required for calculating fluency metrics such as words correct per minute (WCPM)
  • Information about insertions, deletions, substitutions, repetitions, hesitations, false starts and interjections 
  • Information about stress, pitch, intonation and prosody


Keyword Detection

A voice technology service designed specifically for identifying keywords and phrases in speech. Given an audio file containing children’s speech and one or more target keywords or phrases, the system will return a score of how likely it is that the target word or phase occurred within the audio file. This is particularly useful when analysing child speech, where search terms in an audio file can be identified either in isolation, in a sentence, and where there may be background noise. For example, a child could be prompted to name their favourite animal and given a list of animals to choose from. The keyword detection feature scores for each of the possible responses. This data is then used to trigger a response or follow-up action within the game or lesson.


Pronunciation Assessment

Assesses the quality of the pronunciation of an utterance, word, phrase or phoneme. The output is a confidence or pronunciation score, which compares what the child said to the target or given word, and scores it accordingly down to the phoneme level. 


Fluency Assessment

Fluency assessments are designed to assess children’s oral reading fluency.  A child is asked to read a passage and the number of word substitutions, omissions, insertions and correct words are recorded and counted up to calculate a fluency measurement, such as words correct per minute (WCPM).


Speech Therapy Assessments

Voice technology can identify speech patterns that may point to speech development pathologies and common patterns that can enable appropriate intervention.  Voice technology supports the practice of speech patterns and sentence structure. It also supports at-home practice between speech therapy sessions, while providing progress data to speech therapists and their patients.



Investing in technology, design and processes that ensure individual user’s data privacy rights are protected from the earliest stages in the development of a technology through to the end-user experience. When it comes to kids’ data rights, privacy cannot be an afterthought or designed in at a later stage. Privacy needs to be baked into every level of infrastructure, data and process and be part of the ethos and vision of the developer from the very beginning. 

Privacy-by-design commits companies to transparency when it comes to handling data, for example, a commitment to only use the data they collect to improve the service, and not for any commercial purposes such as reselling, profiling or advertising.


Intent Recognition

A form of natural language processing, another subfield of Artificial Intelligence, focused on classifying a voice input based on the speaker’s purpose. Common uses include chatbots and customer support systems, but there are also opportunities to use intent recognition in educational technologies and other solutions intended for children.

Dr. Amelia Kelly

VP of Speech Technology

Dr Amelia Kelly is an artificial intelligence engineer and scientist specialising in automatic speech recognition of children’s voices. She is currently VP of Speech Technology at SoapBox Labs and holds a B.Sc. in Physics and Astronomy from NUI Galway, and an M. Phil and Ph. D in Linguistics and Speech Technology from Trinity College Dublin. Amelia has more than a decade of experience in speech signal processing, natural language processing, machine learning and artificial intelligence. In her career to date, Amelia has held various positions in industry and academia, including with IBM Watson, in Silicon Valley, and as a research fellow at Trinity College, Dublin. As well as various academic publications, she holds a patent in the area of cognitive computing and is a regular speaker at international technical conferences and industry events. She is a Fulbright TechImpact Scholar for 2020/2021.

SoapBox Labs is committed to delivering world-class voice tech for kids that protects their privacy.

To learn more about our voice technology for kids, email us at