Posted on 21 March 2013

Every Friday, iZotope hosts a weekly "tech talk", a presentation on industry-related topics such as signal processing, product design, software engineering, the audio technology market, and more. Last month, we invited Nick Bryan (CCRMA, Stanford) and Gautham Mysore (Adobe Research) to give a talk about their recent work entitled Source Separation in the Real World.

Audio source separation is hard. Really hard. Especially for real-world signals. Algorithmic advances such as nonnegative matrix factorization (NMF), probabilistic latent component analysis (PLCA), and sparse coding are great, but these algorithms are not powerful enough on their own to separate audio sources as much as we would like. How can we expect them to separate a fifteen-part symphony? We can't. Shoot, a jazz trio is difficult enough.

What if we add domain-specific knowledge to the algorithm? For example, if we know that the input has musical qualities, maybe we can bias the algorithm by grouping harmonic or percussive content. This idea -- constrained source separation -- is a hot topic in the community.

Unfortunately, in many cases, domain-specific knowledge is still not enough. Some contexts cannot be predicted perfectly. For example, even if we know that the input contains musical content, "music" can mean many things. A constraint that works for, say, hip-hop may not work for rock.

That's when we can benefit from human interaction. However, human interaction is a design and interface challenge, not merely a signal processing challenge. What would a source separation editor look like? How can it maximize separation while minimizing user effort?

Nick Bryan and Gautham Mysore address this question in their recent work, soon to appear in IEEE ICASSP 2013 and ACM IUI 2013. In this work, Nick and Gautham present a modification to the NMF/PLCA algorithm that incorporates user feedback. Basically, a user selects regions from an audio spectrogram that correspond to one source or another. So, imagine a user "painting" on a spectrogram to select, say, a cell phone ring. The user then paints on the spectrogram to select another source, say, a human voice. The algorithm revises its output (the separated sources), and the user selects additional regions on the spectrogram to further improve the separated output. This process repeats until desired, e.g., until the human voice and the cell phone ring are satisfactorily separated.

The process is illustrated in the diagram below, reproduced with permission from the ACM IUI 2013 paper. A user annotates portions of the spectrogram that correspond to different sources, and the algorithm reacts accordingly.

Altogether, I like this and similar works, because it confronts a harsh reality that many academics hate to admit: computers can't do everything. That's okay; for some problems, with a minimal amount of user intervention, we can achieve an outcome that is far better than any fully computerized solution could ever achieve.