You and a friend can talk to each other at a noisy cocktail party, but computers have been confounded for decades by such complicated noise. New algorithms have been developed to allow computers to distinguish individual voices, at times ever better than humans.
I the 1974 movie The Conversation, Harry Caul is monitoring a conversation between a couple as they cross Union Square in San Francisco. A science fiction gadget deciphers the conversation and comes up with a critical line, “He'd kill us if he got the chance.” Is such a gadget possible today?
Significant advances have been made to this “cocktail party problem” in the last ten years. Humans can pick out the voice they are interested in and lock onto it. But computers have extreme difficulty mimicking this skill, especially when two people are talking at once.
Martin Cooke of the University of Sheffield in England, along with Te-Won Lee of the University of California, San Diego, created a “challenge” in 2006 of separating and mixing the talk of two speakers. The authors of the Scientific American article, John R. Hershey, Peder A. Olsen and Steven A. Rennie, and Andy Aaron, along with Google's Trausti T. Kristjannson entered the challenge and developed their own subsequent algorithm.
Humans have the most difficulty distinguishing “cocktail party” speech when the target speaker is talking at the same volume or slightly quieter than the masking voice.
At an actual cocktail party, humans have the advantage of having two ears and noticing that different voices are coming from slightly different directions. Spectrogram of speech help a computer due to the complexity of the signal when two are speaking simultaneously.
Computers link together phenomes into strings of sounds that are analyzed and compared for probabilities, for example, “white” has three phenomes – W and AY and T. There are three AY possible sounds, and the same speaker may use more than one of these.
A computerized speech recognition model also analyzes the probabilities of different sequences of words. The authors used speech for oral instructions associated with a fictitious board game, for example, “Lay blue by B One now,” which has a structure of command-color-preposition-letter-number-adverb used by all speakers at the cocktail party.
The full structure of the likely phenomes and the grammatical work sequences is called a “hidden Markov model” for 19th century Russian mathematical Andrey Andreyevich Markov. This model is “hidden” from the computer because it only receives the spectrum, not the phenome state directly. IBM began this line of research in the 1970s. The search by computer for the likely states involved is called belief propagation.
As an example of belief propagation, imagine a player rolling two dice. One is true and the other die is “loaded.” After analysis of repeated rolls, it is possible to define what the loaded die is surely loaded to land as. That analysis is a belief propagated by on-going analysis. This technique can be used with phenomes of human sounds.
For speech recognition among crowded situations, an additional technique is used called the Viterbi algorithm, which only keeps the best sequence for each current phenome state, trimming other less probable following phenomes because they are unlikely.
A model for overlapping speech is also used. Suppose the words “three” and “six” are spoken simultaneously. TH – R – EE and S – IH – K – S have a finite number of combinations:
TH TH R R R R EE EE EE EE EE EE EE
S S S IH IH IH IH K K S S S -
Thus a matrix can be built and a Viterbi algorithm used to determine the likely phenome spoken by the desired speaker.
In real speech, though, the words used are not constricted to simple commands to move board pieces, and there may be multiple simultaneous masking speakers.
The more speakers there are, the longer it takes a computer to analyze the combinations and clarify the targeted speech. It takes only ½ a second of computer time to analyze an individual second of speech. With two speakers, it takes 83 minutes to figure out that second of speech. With three speakers, it takes 1.6 years. With four speakers, it takes 16,000 years. With five speakers it takes 160 million years, and with six speakers it takes 1.6 trillion years. So this brute force approach is futile in dealing with even moderately complex cocktail party situations.
An alternative approach is to analyze the speech from the “bottom up” by looking for regions of the spectrogram that are produced by the same speaker. This is called computational audio scene analysis (CASA). Two sounds beginning at the same time probably came from the same speaker. Harmonic sounds probably came from the same speaker. It is possible to emphasize particular regions of the spectrogram that a certain speaker prefers and masking speakers do not use. Rather than search all speech combinations, a speaker is targeted through the spectrogram. An advantage is that fewer assumptions are made about the masking voices. A disadvantage is that the regions dominated by a single speaker are rare and the overlapping areas are larger and it remains difficult to disentangle the areas where many are speaking (especially if speaking similarly).
The authors stayed with a top down analytical method. But consider the “loopy belief” method. In a Sudoku puzzle, the same rules apply for rows, columns and sub-boxes, so the hints given create limits where the hint cannot recur in a row, column or sub-box, limiting the possibilities. The limited possibilities create further limits, and the process solves the puzzle.
Loopy belief propagation can be used to reduce the number of calculations of speech analysis by computer. We can form an initial estimate of the likely speech of A and B. Then we can lock our estimate on speaker B and use a Viterbi algorithm to refine the assumptions about speaker A. Then we can lock A and refine the assumptions about B. We continue this process until our estimates stop changing. With more speakers, we continue this process sequentially. If there are six speakers each with 10,000 sounds, then there are 60,000 combinations, a lot fewer than the trillion trillion combinations required by the full mathematical model.
We could call this an analysis of a “landscape” where the solution means finding the lowest valleys. A pitfall comes up if there are two deep valleys separated by a huge hill. Whether one valley is the right approach for a particular speaker depends on where the analysis started. Having the computer remember the probability of the particular geographies can by accomplished by loopy belief propagation. But for three or more conversants using a large vocabulary, the amount of computation time required is still excessive.
The way to speech inputs combine to form a spectrum of sound is complicated, but there is a general shortcut. Women's speech is louder at high frequencies. Men's speech is louder at low frequencies. This is a very useful rule for analysis of the spectrum. So we can make rational assumptions about who is dominating a particular frequency band of the range for each frame (40 microseconds or 1/25th of a second). The authors have further added the feature of two step analysis: use current information to estimate the speaker through frequency and loudness; then update the beliefs of all speakers. This additional method greatly reduces the possibility of getting “stuck” in the wrong probability “valley.”
The authors have been able to use this advanced algorithm based on speaker masking to separate a single speaker from the babble of three others. The authors succeeded in doing so even though the speaker who they sought to identify was not speaking the loudest.
The authors note that not only is this analytic system useful in audio analysis, but it also has potential to sharpen brain imaging analysis and the development of intelligent robots. They aver that the “era of robust robotic audio interaction” has begun.
– summarized by the blog author from Audio Alchemy: Getting Computers to Understand Overlapping Speech by John R. Hershey, Peder A. Olsen, Steven J. Rennie, and Andy Aaron from the April 12, 2011 Scientific American, on line at http://www.scientificamerican.com/article.cfm?id=speech-getting-computers-understand-overlapping
No comments:
Post a Comment