How Image Processing Enables Artificial Intelligence to Recognize Speech
Artificial intelligence has reached new heights in the last decade, with technology companies like Google, Amazon and Facebook all investing heavily in machine learning algorithms. The most impressive example of this progress can be seen in Google’s Hey, Siri software, which lets anyone with an iPhone or iPad access their voice-activated personal assistant from anywhere in their home simply by calling out hey, Siri. How does this technology work? We’ll explain how image processing enables speech recognition in artificial intelligence through the following points.
Background
Researchers have developed an artificial neural network, or ANN, that can analyze videos and audio files and decide with at least 90 percent accuracy whether or not it contains someone speaking. This has raised new concerns about privacy, especially when many of these technologies are available for sale to consumers who might use them for nefarious purposes.
For instance, say you’re worried your significant other is cheating on you; you could secretly record him or her and run it through an ANN (which also costs around $1,000) to find out if they were lying. While that’s a bit extreme, as researchers develop more sophisticated systems such as Skype Translator (Microsoft), it’s something we should consider before we start talking in front of our computers all day long. How would you feel if your computer knew what you said?
How would you feel if everyone else’s did too? If you think about it from a different perspective, we already allow people access to our private conversations—our doctors, lawyers and therapists all listen in on our problems—so why should it be any different for computers? Perhaps because they won’t give us advice afterwards.
However, they will process what we tell them without bias and then make their own decisions based off that information—something human beings are notoriously bad at doing. It’s one thing to hear your doctor tell you you’re fat, but it’s another thing entirely if he starts calculating how much weight loss surgery will cost and how much time you’ll need off work after recovery.
Image Processing
Speech recognition involves computers recognizing human language and responding accordingly. Without it, most of today’s computing devices would be useless; imagine having to type out a message when you could simply speak and have it understood.
How can computers understand human language? Image processing is at its heart. Image processing describes how computers apply mathematical functions, such as pattern recognition and feature detection, on visual media such as photos or videos. Speech is just another form of visual media—albeit with a unique set of characteristics that present unique challenges for computer programs attempting to discern meaning from sound waves.
To make sense of speech, computers use algorithms to interpret signals from audio files. These signals come in two forms: waveforms and spectrograms. A waveform is what we hear as an actual voice recording; spectrograms are graphical representations of those recordings, which show frequency levels over time in varying shades of color.
It’s these graphical representations that enable image processing algorithms to determine key features like volume and pitch—key elements in understanding what someone is saying. When combined with more advanced techniques such as machine learning (i.e., artificial intelligence), these algorithms enable voice-activated applications like Siri and Alexa to interpret what we say into actionable commands.
Audio Waveforms
When you talk, your voice generates sound waves that have a certain shape. In fact, if you had a really powerful microphone and a really fast computer, you could record those sound waves, save them as an audio file, and then play them back on your computer or smartphone. Humans can hear those audio files just fine.
But computers need something called an analog-to-digital converter before they can make sense of audio files. That’s because digital devices are designed to process one piece of information at a time—for example, one pixel or number in an image file—whereas our ears hear hundreds (if not thousands) of pieces of information all at once. So how do we get from recording human speech to understanding what someone is saying? It all starts with converting waveforms into numbers.
This process is known as digitization, and it involves sampling waveforms many times per second. The more samples you take, the more accurate your resulting digital model will be—but it will also take up more storage space on your hard drive or in memory.
To balance accuracy with storage space, engineers typically sample waveforms around 8 kilohertz (8 kHz). For comparison, humans can typically hear sounds between 20 Hz and 20 kHz, which means that 8 kHz is about 10 times faster than we can actually perceive sounds!
Artificial Intelligence
An artificial neural network (ANN) is an interconnected group of nodes, akin to a biological neural network, which processes data in a way similar to that seen in living organisms. ANNs have been created and used for image processing since 1969, but artificial intelligence was not applied to speech recognition until 1990. Since then, however, progress has been rapid. From 1990 to 1996 alone speech recognitions accuracy improved
about 14%, although it has leveled off ever since. In 2004 IBM’s Deep Blue supercomputer beat world chess champion Garry Kasparov in a six-game match and from 1997 to 2005 IBM’s Watson computer beat Jeopardy! human champions Ken Jennings and Brad Rutter. Nowadays, almost all smartphones use some sort of voice recognition software.
The basic principle behind voice recognition technology is simple: A device listens to sound waves through a microphone, converts them into digital signals, analyzes them with algorithms and compares them with pre-recorded sounds. The system compares what it hears with previously recorded words or phrases stored on its database in order to determine what word or phrase was spoken by analyzing patterns of sound waves. However complex systems require many hours of recordings; Google’s database includes over 1 billion words while Microsoft’s Bing Speech API contains around 100 million words.
Conclusion
Speech recognition, a useful tech tool in its own right, is just one of many applications that can benefit from improved image processing. By improving computational imaging’s ability to analyze and interpret images at fast speeds, researchers are helping AI become smarter and more sophisticated than ever.
From face recognition that could make your security system virtually impenetrable to future smart cars with 360-degree vision, there are plenty of benefits in store for consumers around the world once commercialized versions of these technologies start becoming available. While you might not think about it every day, AI has already affected your life. With better image processing, it’ll continue doing so—and much more besides—in ways you probably don’t expect.