The Co-Evolution of Speech Technology: A Road Map for Adaptation

March 19, 2019


Former Director Technology Innovation Enablement, Disney Parks & Resorts

Not long ago, the sight of someone out in public speaking aloud – to no one in particular – aroused concerns that the person was delusional, talking to an imaginary friend or foe. Today it is common for phone users to converse with distant friends while strolling, blurting private details into the air as if no one around them could hear. There was a moment in time when our perception flipped, from not wanting to look foolish chatting with the invisible friends, to accepting this behavior as the norm.

With further advancements in conversational technologies and the data streams and knowledge bases they tap into, a few years from now, chances are that we will also be chatting in public with our digital assistants. These AI voice bots will not only take verbal commands to navigate, buy tickets, or play music, as Alexa, Google Assistant and Siri do today, but they may also be keeping a digital journal, providing reminders, diverting our attention with personalized advertising, and possibly, providing advice and support.

As these advancements are absorbed into consumer products and grab the headlines, we might also take note of their impact on our behavior – and ultimately, our welfare. Indelibly, a coevolution is taking place, between AI voice enabled systems and the behaviors of people who use them.

As a biological phenomenon, coevolution was first identified by Charles Darwin in the On the Origin of Species, when he observed that certain flowers and insects could not have evolved into existence without one another. Their relationship is symbiotic, in that each needs the other for ongoing survival.


While not biological, our relationship with conversational systems is becoming symbiotic. For our own survival, then, we might ask, where are the AI voices taking us?Who benefits from them and how, and what are we losing in the process? What or who is driving progress, and is there a road map that we can use to encourage beneficial results?

Fueled by competition to win the favour of consumers, conversational technologies are being designed to adapt to our preferences. The holy grail of the digital consumer product is to provide “frictionless” interactions with the most pro table actions in the system. These impose no impedance to the behaviours that the sponsor most wants to encourage, such as making a purchase, favouriting an item, lingering in a shopping area, or sharing a product with friends. Reducing friction also entails minimizing time-to- satisfaction; for example, the use of the Alexa for Hospitality product at Marriott hotels enables guests to speak their wishes in room, and obtain amenities and information such as room service and checkout times, without waiting for a human receptionist or concierge to acknowledge them. For consumers, instant gratification is one allure of the frictionless interface, and it is influencing our behavior.

A key attribute of this experience is that it doesn’t require us to learn how to operate it. Natural speech – learned by nearly all human beings in childhood – is the ultimate frictionless experience. It requires minimal effort to express our preferences and sentiments, as it has for millennia.

As with many technologies that mimic human capabilities, though, our expectations outpace the progress of conversational technologies. Anyone who converses with an AI voice-enabled call center or device quickly learns that these assistants still don’t handle natural speech all that well. Speech is a hallmark of human intelligence, and we expect that which is speaking intelligently to be intelligent – and it can be funny or highly frustrating when it’s not.

Although they are improving, speech recognition technologies cannot yet handle speech that is fluid, run-on, raspy, off-topic, heavily accented, or in a very noisy environment. They still require users to speak a limited, coded language – as when the home devices require a “wake word” and all of the assistants require highly structured speech – to enjoy the bene ts of playing music, hearing the news, or turning off the lights. When processing a command they don’t understand, or when listening when they shouldn’t, digital assistants go off course. Only when in their designated domains can they recognize and declare what they don’t understand.

Significantly, AI voice assistants are largely transactional today. They do not store context nor personal information, in part due to privacy concerns of passing that data across corporate and cloud boundaries. Goal-directed AI assistants can be constructed to follow a dialog owchart to answer a pre-determined line of questioning, say, with the goal of determining a travel booking. However, artificial intelligence does not yet exist to sustain an open-ended or in-depth conversation.

The benefits of speech technologies derive in part from their connection to back-end knowledge bases and data. Some of their current limitations derive from knowledge sources that are siloed for reliability and security as well as proprietary or commercial reasons. Hence at present, the Domino’s Pizza assistant can’t find a gas station nor play a song, while Siri can’t access your bank account to pay your bills.

The design criteria for various assistants is differentiated as well. A digital assistant to serve the public on behalf of a large enterprise must be held accountable for the information it provides, and it must uphold brand values in the words it chooses and in the voice it uses. An assistant providing vehicle navigation instructions must be reliable and precise, whereas an assistant asked to play a song can be forgiven for picking the wrong tune.

The consumer digital assistant and call center uses of speech technologies are only the first of many future applications. There are multiple opportunities to deploy them to improve social welfare and increase economic engagement outside the current digital consumer demographics. They represent significant research and development opportunities. Examples of possible applications include:

Listening to the elderly. While conversational devices cannot replace the attention and physical care that elders need, they can help to ll gaps in time for those who are no longer mobile, or suffering from dementia, much the way that a television does, but in a more personal and engaging way. The Gerijoy device and service provides an example of this concept, while at the same time, however, refuting the current readiness of speech technologies to serve this purpose. Gerijoy employs human operators in the Philippines to speak with elderly clients at home alone or in assisted living facilities. The system includes a tablet for the user to speak to while operators appear as animated animals on the screen. While this type of service is in its early stages of development, there are many opportunities to provide targeted services for the elderly and disabled, provided that users’ needs are carefully targeted.

Accessing entitlements. Voice- enabled systems could be deployed to reduce the tedium and frustration of accessing information and services and entitlements from insurance or government programs. In the US, the Medicare bureaucracy is tasked both with warding off fraud and servicing elderly and disabled insureds. For these populations, Filling out complicated forms can be extremely difficult if not impossible, making them further dependent on others. A voice-enabled system could help both with deterring fraud by using a voiceprint for biometric identification, and by enabling a more conversational dialog that provides the data needed to ll out forms. The technology provider, Interactions, developed a system for Humana that enabled Medicare insurance applicants to ll out a detailed, lengthy form entirely with voice input over the phone. The Interactions system, however, used human listeners to ll in gaps that the automated system could not reliably hear or understand. Future innovations could make systems like these more conversational and more useful.

Aids for the hearing impaired. Ironically, speech technologies are well positioned to help the those who cannot hear well, because the technology can hear, transcribe and display words, effectively providing “closed captioning” for the real world. Transcription apps could help a hearing-impaired person function almost normally in social situations if they were ever-present and accessible. The World Health Organization estimates that 25% of people over the age of 65 are affected by disabling hearing loss, as are 5% of the overall population (466 million people). The potential market is large, but consumer products in this arena are still sparse.

No more typing. For first-world users, conversational technologies hold the promise of freeing us finally from the need for typing. Typing is learned later in life than speaking and engages different parts of the brain. Having to type creates a communication barrier for those who don’t have their hands free, who didn’t learn to type, who have hands too large for the device, or are arthritic. For third-world users in the near-future, they may be able to access all of the computing applications available to the developed world, but without ever having to learn to type or thumb the keyboards on their mobiles. As with many technological innovations, it is now becoming possible for developing countries to leap frog developmental stages and gain the benefits of the latest advances even before many people in developed economies.

Future systems must take into account the privacy needs and rights of their users, and the security of the information vocalized. As one example of a clear bug that was quickly resolved, a user of Amazon Echo Dots in Oregon found that they were recording her conversations and sending them in email to one of her husband’s employees! More often, the issues that arise are not due to bugs or design flaws in the device, but rather unforeseen circumstances in connection with private speech. In 2016 Amazon was asked by investigators in Arkansas to share the Alexa recordings from a household where a man was found dead in the hot tub. Evidence like this is often sought from mobile phones and email, but the introduction of more speech-enabled devices in homes and hotel rooms invites added capture of private and intimate speech, that most users would never consider sharing via a phone.

An intrinsic design flaw with all voice applications is that they lack privacy when the user is within earshot of other people. The well-designed, AI voice- enabled call center may be a delight to use even in a room full of people, until it asks for your birth date or credit card number. This issue may extend the use of typing for the foreseeable future, at least until room features such as “phone booths” can be provided for private conversations in public places.


The coevolution of speech technologies and human beings is probably most important for the children who are adopting AI voice devices. In the near term, our concerns may be focused on their security, on shielding them from predators, and protecting them from self-harm (or from harming parents’ bank accounts by ordering expensive products). Educators may muse about the educational impact of knowledge on-demand that is provided merely for asking, to little ones who don’t even know how to spell and type the question. Little research has yet been done about children’s perceptions of digital assistants; even less is known about the impact on their cognitive development.

But children are the ones who will be most affected by these advances, and who will be most able to coevolve with them. One thing is certain: whatever emerges, we will probably be amazed by the impact on both sides of the coevolution.

Laura Kusumoto is an experienced pathfinder for technology innovation in the development and commercialization of new digital products and services. As an innovation leader with Kaiser Permanente and then Walt Disney Parks and Resorts, she promoted a portfolio of innovations that take advantage of emerging technologies such as internet of things (IoT), artificial intelligence (AI) and conversational platforms, gamification and avatars, by engaging business stakeholders and technology partners in understanding and adopting them. Earlier in her career, Laura served in executive management and software product development roles in multiple startup and established companies, including LEGO, Intuit, and Price Waterhouse (now PwC).