Getting Computers to Understand Overlapping Speech

Empty cocktail glass. No copyright.
Image via Wikipedia

You have little trouble hearing what your companion is saying in a noisy cafe, but computers are confounded by this “cocktail party problem.”

New algorithms finally enable machines to tune in to the right speaker, sometimes even better than humans can.

The year is 1974, and Harry Caul is monitoring a couple walking through a crowded Union Square in San Francisco. He uses shotgun microphones to secretly record their conversation, but at a critical point, a nearby percussion band drowns out the conversation. Ultimately Harry has to use an improbable gadget to extract the nearly inaudible words, “He’d kill us if he got the chance,” from the recordings.

This piece of audio forensics was science fiction when it appeared in the movie The Conversation more than three decades ago. Is it possible today?

Sorting out the babble from multiple conversations is popularly known as the “cocktail party problem,” and researchers have made many inroads toward solving it in the past 10 years. Human listeners can selectively tune out all but the speaker of interest when multiple speakers are talking. Unlike people, machines have been notoriously unreliable at recognizing speech in the presence of noise, especially when the noise is background speech. Speech recognition technology is becoming increasingly ubiquitous and is now being used for dictating text and commands to computers, phones and GPS devices. But good luck getting anything but gibberish if two people speak at once.

A flurry of recent research has focused on the cocktail party problem. In 2006, Martin Cooke of the University of Sheffield in England and Te-Won Lee of the University of California, San Diego, organized a speech separation “challenge,” a task designed to compare different approaches to separating and recognizing the mixed speech of two talkers. Since then, researchers around the world have built systems to compete against one another and against the ultimate benchmark: human listeners.

See Also

Read more . . .

 

Enhanced by Zemanta
What's Your Reaction?
Don't Like it!
0
I Like it!
0
Scroll To Top