Google's new AI is able to pick out voices in a crowd

Google works out a fascinating, slightly scary way for AI to isolate voices in a crowd

Google works out a fascinating, slightly scary way for AI to isolate voices in a crowd

Even though you may not know the location of where your friend is standing, his or her voice has a certain pattern which you can immediately recognise irrespective of the noisy crowd around you. Parts of the speeches and lectures with no background sound and just a single person in view were then extracted to generate videos of a cocktail party type environment, with non-speech background noise obtained from AudioSet.

The Google development team trained its neural network model in order to recognize individual people's voices.

According to the blog post, the researchers developed this model by gathering 100,000 videos of "lectures and talks" on YouTube, extracting almost 2,000 hours worth of segments from those videos featuring unobstructed speech, then mixing that audio to create a "synthetic cocktail party" with artificial background noise added. This would be a particularly big convenience for videos recorded in noisy environments that may at times make it hard to hear the speaker.

In the test clip from above, Google has managed to separate the voices of the two stand up comedians from the crowd (and each other) by recognising their faces and generating an audio track for that particular individual's speech. Google looks at the movements of the person's mouth and correlates that with the sounds that are produced as the person is speaking.

"The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video".

In its Developer Blog, Google writes about the feature that it is "an entirely new way to explore books by starting at the sentence level, rather than the author or topic level". The AI tracks a person and their voice even when their face is obscured with a waving hand or a microphone. The experiment also means that newer technological features are coming soon to enhance user experience. China could easily implement something like this on a mass scale, considering how it has been using facial recognition technology to compromise law-breakers in the country.