How to remove music from the first audio using the second audio?

capslk wrote on 1/27/2026, 11:40 AM

Hello, everyone.

There are two audio tracks: the first one contains music, notification sounds, and microphone audio. The second one contains only the same music as the first track.

What needs to be done: remove only the music from the first audio track using the second audio track, while leaving the notification sounds and microphone recording on the first track.
How can this be done in Vegas Pro?
I know about auto-ducking, but that's not quite what I need.

Comments

john_dennis wrote on 1/27/2026, 12:23 PM

@capslk

You might be able to separate the audio by STEM separation.

Link: New Script: Demucs Stem Splitter for Vegas Pro (AI Audio Separation)

or this method:

Link: Shutter Encoder Now Has STEM Creation as an Add-In

Robert Johnston wrote on 1/27/2026, 2:34 PM

@capslk Don't know how well this will work, but maybe Invert Phase of one of the audio events that has all the sounds. When played back with the other audio event with just the music, it will subtract out the music leaving the other sounds. Of course the events have to be perfectly synced. You may even need to adjust the gain if you hear bleed thru.

Intel Core i7 10700K CPU @ 3.80GHz (to 4.65GHz), NVIDIA GeForce RTX 2060 SUPER 8GBytes. Memory 32 GBytes DDR4. Also Intel UHD Graphics 630. Mainboard: Dell Inc. PCI-Express 3.0 (8.0 GT/s) Comet Lake. Bench CPU Multi Thread: 5500.5 per CPU-Z.

Vegas Pro 21.0 (Build 108) with Mocha Vegas

Windows 11 not pro

john_dennis wrote on 1/28/2026, 2:14 AM

Here is how the Demucs STEM Splitter script did with your requirements.

capslk wrote on 1/28/2026, 2:30 AM

@Robert Johnston I just tried inverting the phase on the first track, but nothing happened. As I understand it, this function simply swaps the right and left channels.

capslk wrote on 1/28/2026, 2:38 AM

Here is how the Demucs STEM Splitter script did with your requirements.

Oh, I see that it has identified the notification sound as drums. Okay, I'll try downloading this script now and test it

DMT3 wrote on 1/28/2026, 7:42 AM

@Robert Johnston I just tried inverting the phase on the first track, but nothing happened. As I understand it, this function simply swaps the right and left channels.

No, it does not swap channels. It actually swaps the phase of the audio. If you look at the waveforms zoomed in and swap the phase, you will see that the waveforms invert. In order for this method to work, the audio from both channels have to be synced EXACTLY. You can do this by looking at the waveform and lining them up. This method though may or may not work, depending upon if there is any drift in the audio channels and if they are the exact same recording.

john_dennis wrote on 1/28/2026, 10:36 AM

The phase inversion method could work. Aside from the steps @DMT3 mentioned, you also have to balance the mix of the two tracks to get the cancelation that you want. Possibly, you might have to deal with stereo balance, too.

DMT3 wrote on 1/28/2026, 10:38 AM

@john_dennis good demonstration.

rraud wrote on 1/28/2026, 11:25 AM

Hi @capslk, can you upload a few minute of the audio files to a cloud drive for us to listen and experiment with?
To extract music stems, voice, noise and such, I always use SpectraLayers Pro (SLP),.. but it ain't freeware.. None of the free apps I have tried work as well (IMO), but may be sufficient depending on the audio quality.. or lack thereof.
btw, SLP is however included with the Sound Forge Pro Suite .. though I am not sure at this point.

capslk wrote on 1/28/2026, 12:30 PM

Hi @capslk, can you upload a few minute of the audio files to a cloud drive for us to listen and experiment with?

Yes, of course, here it is - https://drive.google.com/file/d/1js4EJNsz6jZeJXF4eRpMfMP2CDVVnQuE/view?usp=sharing

This five-minute excerpt features music and the most common notifications, which sound like “ding” and “double ding.”
Basically, my job is to cut out absolutely all sounds that resemble notifications. I am given 3-9 hours of audio, and I listen to it and cut it out manually in Vegas. But of course, I would be very happy if this process could be automated.
By the way, my former account on this forum was called @ryan-hall34

john_dennis wrote on 1/28/2026, 1:06 PM

@capslk In your original post, you mentioned a music-only track. If exists, please include with your sample file.

WAV is better for post processing to be valid.

Howard-Vigorita wrote on 1/28/2026, 1:17 PM

@capslk In your original post, you mentioned a music-only track. If exists, please include with your sample file.

WAV is better for post processing to be valid.

Flac is another good choice. But if you throw it into a zip for download, wav & flac will probably compress to a similar size.

capslk wrote on 1/28/2026, 3:28 PM

 If exists, please include with your sample file.

There are only 3 songs in the five-minute excerpt. I found and downloaded the samples of these songs in FLAC format and put them in a ZIP archive: https://drive.google.com/file/d/1OyUJuH7-yNfUHKE17lW46_Y6KOhBg_fC/view?usp=sharing

john_dennis wrote on 1/28/2026, 5:00 PM

Working with un-synced audio from these songs is likely to be more effort than it's worth.

Howard-Vigorita wrote on 1/28/2026, 10:08 PM

@capslk Unmixed the 3 flacs with Spectral Layers v11 and pulled the stems into a vp23 project. Got this:

https://drive.google.com/file/d/1vi1eTOm1DGljtgtUTVcGFlyLmF7YqM5o/view?usp=drive_link

Left the default stems checked in the unmix-song module and set the processing level to extreme. It occurs to me that approaching it this way wouldn't make less work unless all the sounds to be removed were isolated to one of the stems. But removing them from stems with volume envelopes would probably give the most transparent results.

rraud wrote on 1/29/2026, 12:58 PM

Below is a screenshot from SpectraLayers Pro, Sinewave type frequencies are relatively easy to ID in a spectral graphic display. SLP's erase or clone tools can make the annoyance practically inaudible, I did not have time to try the automatic unmix tools. The SLP screenshot is from around 2:12 to 2:20. the fundamental 'beep' frequency is right around 2kHz with some clicking harmonics around 5.5 and 8kHz. Prior to loading the file into SLP, I peak normalized in Sound Forge to 0dB because the overall level is very low.
As I stated in my above comment, manually editing the spectral graph is not that easy if one has not done had practice and experience with a pro photo editing app lessens the learning curve.

.
 

capslk wrote on 1/30/2026, 9:59 AM

Well, to sum up:

1. Audio phase inversion does not work for me, because the sound waves of the original song and the audio recording do not match (I wonder if there are any plugins for aligning the phase between tracks?).

2. I tested the Demucs script, and it recognized notification sounds in different categories. That is, some notifications were in drums, some were in vocals, some were in the “other” category, etc. So no, this doesn't work either. It would be great if there was a model that recognized notification sounds and recorded them in a separate category.

3. SLP is also unlikely to work. It's easier to go through all the audio in VP and cut out what you need. Although if there are automatic tools for removing sounds in SLP, it might be worth looking into.

As I said earlier, I'm currently doing everything manually in VP, and I recently started using a sound spectrogram that I generate through FFmpeg. This spectrogram method was suggested to me by @Howard-Vigorita 7 months ago. At the moment, my workspace in VP looks like this: