The building blocks of literacy: Assessing short sounds with speech technology for kids

March 10, 2022

Rectangle Circle

This blog is by Mauro Nicolao, Head of Speech & Siva Reddy Gangireddy, Senior Speech Engineer

Foundational literacy skills start with letter naming, phonics, and phonemic awareness. These are universally accepted as critical building blocks for future literacy and learning skills.

Children are developing language skills from birth, learning sounds, structure, and association of sound to objects, but the formal step to literacy starts when children start to establish the link between sounds and their written symbols. 

A child’s level of phonological awareness at the end of their first year in school is a strong indicator of their future reading success. At this stage, there’s a strong focus on aural and oral skills in the classroom. Teachers encourage students to focus on and manipulate individual sounds (phonemes) in spoken words. They also initiate the link between short sounds and visual graphemes. 

Repetition, practice, and correction are essential to embedding these early skills, which are foundational to later skills such as spelling, word decoding, and recognition.

What are short sounds?

Short sounds are all the speech sounds that are shorter than words that early readers are asked to say out loud. 

Normally, short sounds are 

  • letter names such as a /ey/; b /b iy/; c /s iy/,
  • letter sounds (i.e., phonemes) such as  a /ah/; b /b/; c /k/, or
  • letter groups or sequences to address specific sounds (i.e., phonics) such as  x /k s/;  ue /y uw/;  ng /ng/. 

Short sounds can be produced as the result of reading a list of isolated letters or as part of sounding-out existing or nonsense words. 

Assessing short sounds with speech technology

SoapBox Labs’ education customers deliver learning and assessment content across the full literacy journey. From foundational skills to fluency, they use our kid-specific speech technology to enable independent practice and conduct formative and summative assessment of skills development and progress.

Since learning letter names and sounds is a crucial first step of learning to read, they wanted to use our speech technology to perform automatic assessments of short sounds by children ages 4 – 6. 

SoapBox’s Short Sounds feature

SoapBox responded by setting out to develop our new Short Sounds feature.

We started by generating the following three robust use case groups and then conducted in depth evaluation of our system’s performance against these examples:

  1. Individual isolated sounds
  2. Sequential sounds (e.g., sounding out)
  3. Combined sounds (e.g., blending and decoding)

When validating our findings with customers, we also uncovered a requirement to enable customers to manage specific tasks such as “nonsense” words. 

Our Speech Technology and Engineering teams then designed and developed a set of dedicated methods to tackle accuracy in the short sounds domain, as well as tools to support customers in using the Short Sounds feature.

With our Phoneme Breakdown feature, our voice engine already returns confidence scores for the individual phonemes in the words a child utters. Comparatively, our new Short Sounds feature allows our voice engine to identify and score  individual short sound inputs with unprecedented accuracy. 

Phonemic sound-out of words is important for our customers because it allows them to

  • create new tools for automatic reading assessment, 
  • be flexible for different accents and in the use of different phonemic systems, and
  • focus on phonemes and be assured that our voice engine is performant for any learning framework. 

The challenges

Usually, short speech sounds have short durations, approximately 100 – 300 milliseconds (ms). Verifying such short events is notoriously difficult for all speech technology systems, which normally struggle to detect events in that duration range. 

Standard speech technology relies on linguistic- and acoustic-context information to boost the detection quality of speech events in an audio file. When sounds are co-articulated into words, the surrounding linguistic context mitigates the sound recognition uncertainty, compensating for detection errors and noisy conditions. In the case of short isolated sounds, context is very limited, as the amount of audio frames to evaluate the audio is minimal and they are produced in their stand-alone form. 

Compared to adults’ speech, children’s speech is a well-known challenge when it comes to recognizing structured speech, such as words and paragraphs. In the short-sound domain, challenges are even greater as speakers ages 4 – 6 often say more words and sounds than what’s required in a given pronunciation exercise.

Markup solutions for short sounds

Given the challenges presented by short sounds, we needed to tailor our voice engine to achieve the same performance on short sounds. 

Our main objective was to clearly differentiate the isolated phonemes from the other use case groups. 

In order to better address our education customers’ requirements, we created three markup solutions.

For a detailed discussion of markup and the role it plays in SoapBox’s speech technology, read this blog by our colleagues Lora Lynn and Niall.

In the following sections, we describe the three markup tags and have included the results of some tests that we conducted for each use case. In the audio that we tested, a five-year-old child was asked to pronounce a letter name, sound-out a word, and attempt to pronounce a nonsense word that they had never seen before. 

1. Letter names

If the child is asked to say a letter name, the target needs to be used with the <letter> markup


<letter>g</letter> if the pronunciation of the letter name needs to be scored (i.e., /jh iy/). 

In this audio, the child articulated the name of the letter perfectly.

Results returned from our system:

  • Word:  “g”, quality score: 95.0
  • Phone breakdown
    • phone: /jh/, quality score: 92.0
    • phone: /iy/, quality score: 92.0

2. Letter sounds

If the child is asked to say a letter sound, the target needs to be used with the <sound-out> markup


<sound-out>south</sound-out> if the pronunciation of the letter sound is to be scored (i.e. /s aw th/). 

In our test, the child struggled with the last sound “/th/”, as clearly reflected in the system score. 

Results returned from our system:

  • Word: “south”, quality score: 74.0
  • Phone breakdown:
    • phone: /s/, quality score: 90.0
    • phone: /aw/, quality score”: 96.0
    • phone: /th/, quality score”: 36.0

3. Nonsense words

Verifying the out-of-vocabulary (nonsense) words was a big challenge, given that those word pronunciations may not be not part of our pre-trained recognition models 

The <custom-word> markup tag is specifically designed for such a peculiar use case, such that the pronunciations (i.e., phonetic sequences) for the custom words are required to be provided along in the proposed markup. 


<custom-word pronunciation = “v uw s”>voos</custom-word> if the pronunciation of the  nonsense word “voos” is to be scored (i.e. /v uw s/). 

Different to the <sound-out> use case, the phonemes in the phone breakdown are scored as coarticulated sounds in a word, not as isolated sounds. 

In this example, the child could not pronounce the nonsense word correctly. They couldn’t pronounce the initial /v/ properly, and they said /ao/ instead of the central vowel /uw/. 

Results returned from our system:

  • Word:  “voos”, quality score: 57.0
  • Phone breakdown:
    • phone: /v/, quality score: 57.0
    • phone: /uw/, quality score”: 22.0
    • phone: /s/, quality score”: 86.0

What’s next?

Introducing a dedicated set of systems that specifically focus on short sounds provides both learners and teachers with a powerful tool to practice and assess their literacy skills respectively. 

We have three lighthouse customers using a Beta version of Short Sounds, and it will be released to all customers in April 2022. 

Reach out to our speech tech experts to discuss your educational use case for Short Sounds.

Want to learn more about our speech technology for kids?

Visit The SoapBox Tech Blog for the latest articles and stories from the Speech Tech, Engineering, and Product teams at SoapBox on how our voice engine works and tips and tricks for designing voice experiences for kids. 

Also, head on over to the rest of our site for videos, guides and a whole host of additional resources centred around the development and promotion of child-centric speech-enabled learning innovation.

Share this