Google Voice User Interface Display at CES 2018
One of the main themes at CES 2018 is the maturation of technologies using voice as the primary user interface. Key players like Amazon and Google are leveraging their cloud infrastructure and AI software to position themselves as fundamental cogs in the upcoming voice-activated world.
In this blog post, Peter Cottreau, VP of Electronics at Design 1st, discusses the underlying architecture of Voice User Interface (VUI) and what it means for VUI prototyping and hardware developers.
What is Voice User Interface Architecture?
The VUI architecture that is emerging is a speaker/mic device that captures voice samples which are transported via the home connection to the internet to voice interpretation software in the cloud. What is returned is an actionable digital command which is interpreted locally and effects the desired control over an IoT device in the home.
The 4 Challenges of Voice UI Architecture:
- Privacy and security concerns.
Consumers are concerned about the big brother feel of this model and this is limiting "always listening" behavior required for wholescale adoption and delivery of the full user experience. A number of hardware product have been developed to help mitigate these concerns but these come at additional cost and UI friction. - Network availability and access cost.
Not all countries/locations enjoy the network access and reliability of the big cities. In the event of network outages, access to things may prove frustrating in a world where voice activation has come to be relied upon. - Network latency:
Slow or congested networks can result in command latencies that can hamper or ruin the user experience. - Language support.
Current popular cloud voice services support a very limited number of languages. Furthermore, variations in accents pose serious problems and can result in very low voice-to-command success rates.
Voice Processing Architecture
Solution to VUI Architecture Challenges: Local Voice Interpretation
In the face of these challenges, there is a strong case to be made for augmentation of the current architecture with a solution where simple IOT device command and control is interpreted locally while the more complex or open-ended queries directed to the cloud services.
Example of Local Voice Interpretation Technology
Fluent.ai is a solution provider in this space with a voice interpretation technology capable of hundreds of command phrases at low latency using a small fraction of the compute and memory resources available on the average cell phone.
Decoding is performed on a local device so privacy/security concerns are alleviated and public network availability is not an issue.
With decode latencies in the millisecond range the technology is ideal for local command and control applications. The technology is human language independent and is very tolerant of accent variation.
What do Google and Amazon think?
These benefits are not lost on the incumbent suppliers. Amazon and Google and others will likely look to augment their solutions to provide similar benefit but at least for the time being players like Fluent.ai have a compelling offering. One we are exploring with several IoT voice prototype projects.
Amazon Echo Teardown in the Design 1st Lab
What does this mean for IoT hardware developers?
Current state of the art for low cost connected controllers have adequate processing power to bring simple voice control directly to many IoT devices. These controllers often support the DSP functionality required and are generally fast enough for the job. One of the bigger challenges is memory footprint.
With memory requirements in the 4-6 MB range tradeoffs will be required. Low cost embedded processors will still require external memory keeping the cost of voice enabling a device above $5 for the next while, so it is unlikely you will be talking directly to your lightbulbs anytime soon.