How SoapBox maximizes availability across our web services
February 9, 2023
This blog is by Robert O’Regan, VP of Engineering.
At SoapBox, we’re focused on delivering reliable, accurate, and robust child speech recognition services that can be accessed in a variety of ecosystems, including cloud and edge. Some of our most popular services are our cloud-based solutions, which are available via industry-standard RESTful web services. These web services are designed to operate at scale, are deployed in multiple data centers globally, and process millions of customer requests every month.
Each data center deploying our services has the capacity to deal with the large volumes of traffic we serve globally daily and is configured to automatically scale to meet demand. Thanks to our extensive monitoring services, we can closely monitor regional health and quickly intervene if needed. We analyze and adjust capacity levels with scheduled scaling and auto-scaling triggers, adjusted on a regular basis to ensure the platform is reliable and responsive enough to fulfill the needs of our global customer base.
Cloud infrastructure reliability and redundancy
The primary environment for building and delivering our speech services is the cloud, and we maintain close relationships with our primary cloud infrastructure vendor, Microsoft Azure. Jointly, we undertake regular assessments of the reliability and robustness of our infrastructure to understand where opportunities exist to further strengthen platform reliability and redundancy.
Mitigating outages and disruptions
Given the nature of cloud-based web services, and the underlying cloud service providers (CSPs), outages and disruptions can occur, and at SoapBox, we plan for those eventualities.
Our CSP is constantly intercepting and mitigating DDoS attacks to the platform. Recently, they handled one of the largest ever DDoS attacks. Inevitably, some attacks lead to interruptions on Azure services that SoapBox utilizes, such as FrontDoor.
In the summer of 2022 some of the CSP infrastructure was unable to deal with the sustained and abnormally high temperatures. This led to impacted service for some SoapBox customers. While some of these events are beyond our control, we constantly monitor and review how our platform is performing and try to identify areas where we can add redundancy and reliability.
When issues do occur, we need to react in a timely manner. Thanks to how our platform is designed, configured, built, and deployed, we can adjust in real time where customer traffic is routed, removing problem regions if needed. This ensures that any issues localized to specific regions are mitigated as quickly as possible.
In the event that a disruption to service is unavoidable, we strive to post timely updates on our status page to keep customers informed.
Availability is key, production is everything
Every engineer at SoapBox shares responsibility for ensuring platform stability and availability. We invest in best-of-breed systems and tools and work hard to establish and maintain a positive team culture in which platform reliability and availability is as important as the high levels of accuracy our voice engine delivers to our customers.
Without rock-solid availability and reliability, we can’t deliver.
Production is the frontline: it’s what customers see and use on a daily basis. Ensuring quality in production is key to availability and reliability. Whether it’s product updates, bug fixes, model training, software builds, web service deployment, or infrastructure provisioning, we aim for the highest levels of quality at every step of the development process so that for customers reliability, scalability, and accuracy are baked in.
Ensuring high operational standards
Despite best efforts and intentions, issues can and will occasionally occur that have the potential to impact service availability and reliability. We continue to invest heavily in our monitoring tooling and to evolve how they are utilized, and how we build and deploy software.
Every engineer at SoapBox understands how our platform functions and how the code they ship behaves in a production environment.
At SoapBox, we hold ourselves to very high operational standards. We operate to budget, striving to minimize costs where possible, address issues promptly, mitigate security risks, and ensure platform accuracy, all the while providing the most reliable speech service possible.
We pride ourselves on the fact that we’ve never experienced a full outage where all service has been lost.
Strong technical foundations
We build our speech services upon a small number of core technologies that form our main tech stack. Our teams have decades of experience building with these technologies, and our architecture choices and implementation patterns are simple and proven; we follow industry best practice. The SoapBox engineering team knows what it takes to design and build for reliability, with solutions that are proven at scale.
We actively invest in building and maintaining systems, tooling, and approaches that support our ability to push changes quickly to production in a controlled and safe manner. When it comes to production, we have fine-grained control of the entire platform. We can deploy changes to a small percentage of customer traffic or to a small subset of customers or regions in order to better understand its impact. We can also revert/recover quickly by simply opting to “roll back” to a safe working version of the entire platform.
Providing 24/7 support for global customers
Our customers are global, and that means we support them with continuous 24/7/365 engineering and incident management support. We work continuously to strengthen and protect our service reliability and availability. When we encounter an incident that potentially impacts our customers, our extensive monitoring and notification tooling picks it up, and our incident response process kicks into gear.
Our on-call team is online and responds within minutes of being notified, and their immediate focus is on minimizing customer impact. Typically, we resolve any incidents in minutes, posting updates to our status page where applicable, while simultaneously working to restore optimal service as quickly as possible.
A key part of our incident management process is the incident review, where we deep dive into the root cause, identify any contributing factors, and, where applicable, provide post mortems to customers.
Learn more about SoapBox
Maximizing availability and reliability across our web services is just one (very important!) part of how we deliver an accurate and robust speech recognition system to our education and entertainment customers.
To learn more about how we build our speech technology, visit our Medium page for more tech blogs from our Engineering, Speech Technology, and Product teams.