Google is pretty great at figuring out what a user is saying, but is it any good at knowing who's saying it? Just look at current smart speaker technology, which can be easily fooled.
Google might have a pretty simple solution, however. Its researchers have created a deep learning system that is able to single out voices. It does this by literally looking at people's faces when they're talking.
First, the researchers trained its system to recognize individual people speaking alone. After which they created virtual noise — adding other people to make a fake crowd — as a way to teach the artificial intelligence to separate various audio tracks into distinct parts and thus allowing the system to recognize which is which.
The results are astounding. As seen in the video below, the AI is able to separate the voices of two stand-up comedians even if their individual speeches are overlapping, and it does this just by looking at their faces. The trick works even if the comedians' faces are only partially seen, such as when it's slightly blocked by a microphone.