Giving words meaning: The role of markup in speech technology for kids
March 10, 2022
If you read “1881”, how would you pronounce it?
Perhaps “eighteen eighty one” if it’s a year, “one thousand eight hundred eighty one” if it’s an amount, or “one eight eight one” if it’s part of a phone number.
From this one example, it’s evident that words and sentences can be ambiguous when they’re in a text format, and even when spoken, too. But in a conversation, a person can clarify the meaning of an ambiguous word or phrase by simply asking.
In speech technology, where we’re dealing with audio recordings, it’s not possible to ask for clarification. Yet, we still need to know what the correct meaning is.
To avoid ambiguous phrases — and ensure our customers get the most accurate transcription of a child’s speech — the Speech Technology and Engineering teams at SoapBox Labs just completed a markup project that spanned several months.
This blog is dedicated to explaining what markup is and its role in delivering accuracy in our speech technology for kids.
Markup is part of a larger process called “text normalization.” Let us explain that first.
What is text normalization?
In speech technology, text normalization must be used on all input words, sentences, and paragraphs. Normalization takes a potentially messy piece of text and creates canonical forms of all the words and sentences. In our case, since we work with speech, this often means changing the written form into the spoken form. This can include removing punctuation and lowercasing all words, for example.
Here’s an example of a normalized sentence:
“I have 11 you can borrow,” she said. → i have eleven you can borrow she said
When we normalize digits, we transform the digits into words to get rid of the ambiguity. However, if all we see is the digits, we have no idea how it should be said.
This also holds true for more than just numbers. Should “ACA” be pronounced letter-by-letter like “YMCA” or as a single word like “NASA”? Should “$1.20” be pronounced “one dollar twenty” or “one dollar and twenty cents”?
Markup allows customers to clarify this ambiguity up front and ensure they get the correct pronunciation for “ACA” and “$1.20.”
The role of markup in speech technology
Markup is a method of annotating and giving context to a piece of text. This annotation specifies that this particular part of the text is different from the rest and should be handled in a particular way. Markup is also referred to as “tagging” or “annotation.”
Many markup languages exist — two of the most popular being HTML for creating web pages and interfaces, and XML for structuring data — but our markup is specifically for natural language processing (NLP) purposes and gives customers better control of how their reference text is normalized by our voice engine.
An example of markup in action
With markup, rather than having customers go through their reference text and change text like “1960” to its fully normalized form of “nineteen sixty”, they can wrap a <year> markup tag around it to specify how it should be normalized.
In a customer’s voice-enabled reading exercise, for example, where a child is asked to read “Julie was born in 1960,” the customer would send that reference text to our voice engine as “Julie was born in <year>1960</year> .” This informs our voice engine that the child should pronounce “nineteen sixty,” not “one thousand and sixty,” “one nine six zero,” etc.
The current use cases of SoapBox’s markup language
Years, decades, and numbers
At SoapBox, one of the interesting use cases we found for using markup was in giving customers the ability to not only disambiguate words for normalization but to distinguish or specify certain pronunciations of words.
Pronunciation specification can be important for a number of reasons, including phonics and speech therapy situations.
Another reason is that languages have different dialects, and it may be necessary for our customers to ensure that a child is using the correct pronunciation for a specific dialect.
For example, British English pronounces the letter name for “H” as “hey-ch,” whereas American English pronounces this as “ey-ch.” If a British student is using voice technology with a system that will only score highly on the American pronunciation, this is a problem for both the student and the teacher in that they are unfairly penalized for speaking their native dialect.
This same problem extends to other dialects, such as African American English (AAE) or Latino English (LE). Although our system is trained to handle all dialects of English, mark-up allows the teacher or user to specify which pronunciation they expect and want the child to use, whether that is British, American, AAE, LE, etc.
Our voice engine currently has markup tags for letters, custom words, and sounding out:
Letters are normalized using the <letter> tag. They have an optional pronunciation attribute, which allows customers to specify the desired phoneme breakdown.
The custom-word tag (used with the compulsory pronunciation tag) is used for custom or made-up words.
Sounding out a word can be achieved using the sound-out tag, and there’s an optional pronunciation attribute for specifying the phoneme breakdown.
To learn more about how the SoapBox voice engine handles the pronunciation of short sounds, check out this blog by our colleagues Mauro and Siva.
These are just the current use cases of SoapBox’s markup language. Until machine learning is able to infer the intention of reference text, we’ll continue to add more markup tags to allow for explicit configuration by our customers.
Want to learn more about our speech technology for kids?
Visit The SoapBox Tech Blog for the latest articles and stories from the Speech Tech, Engineering, and Product teams at SoapBox on how our voice engine works and tips and tricks for designing voice experiences for kids.
Also, head on over to our website for videos and additional resources.