Ai Tool for multimodal Voice Separation

I am looking for an Ai Tool to separate two talking voices. Since one of the voices is very hard to understand, I am aiming for the multimodal approach in hope to recreate what the person was saying. I've got a video of only one speaker talking (this is the voice, which is important for me and hard to understand in the audio) and of course the audio channel. Maybe an audio tool could do the work, too. But as already said, the voice is really hard to understand.

I already tried using VisualVoice by Meta:
https://github.com/facebookresearch/VisualVoice

But I just can't get the code to work. There seem to be unsolvable version issues among the required modules, cuda and python.

Please tell me, if you know an alternative or managed to get VisualVoice working :)

submitted by /u/captain_nikolaus
[link] [comments]