Speech to Text via Whisper openAI

bitman wrote on 11/11/2022, 9:40 AM

Some good news, I have been diving into an alternative to "Vegas Pro Speech to Text". Although it is a very fine feature and easy to use, it has some drawbacks: first of all, it is a Vegas 365 - only feature (I hope this may change). Lots of people will not be able to use it - those without a subscription. Like all AI based stuff it is dependent on the model and results may vary. It also lacks a way to tune for quality and language. I suspect it also being more favorable to English.

So here is an alternative: Whisper openAI.

I have created a simple Vegas script to call whisper and convert speech to text. Just place the cursor over an event on the timeline and the script will create result files with text. In a future version I can extend this to create subtitles from these result files on the timeline, feel free to add this or add more of the whisper capabilities like quality, language and translation options. Refer to the document on Whisper at the bottom of this post.

Latest Update 18/12/2023:

Whisper Speech To Text v7":

add backward compatibility for (old "Sony" Vegas versions UI plugin naming) scripting in Vegas 14,15,16 (Note: only tested in Vegas 21 - not tested in 14, 15 or 16)

Here is the link to the latest Vegas script called "Whisper Speech To Text v7":

https://www.dropbox.com/scl/fi/bpj52eo7hjoqda78upi9u/Whisper-Speech-To-Text-v7.cs?rlkey=l0elg1wjl3qi4051uuuscwdws&dl=0

older scripts:

Whisper Speech To Text v6":

sets timeline timecode format to "Time" to prevent subtitle discrepancies on the timeline and reset back to the original user's preference after the subtitle insert

Here is the link to the latest Vegas script called "Whisper Speech To Text v6":

https://www.dropbox.com/scl/fi/5t7pdc16rd3ab65ey2a3r/Whisper-Speech-To-Text-v6.cs?rlkey=y3ueerkpurf6o03amobf3xzkr&dl=0

"Whisper STT RAW v2" (variant):

same script as v6 with the exception that it keeps the original sentence layout "as is" of the .srt file without automatically adding a newline (like a word wrap) after 9 words.

Here is the link to the latest Vegas script called "Whisper STT RAW v2":

https://www.dropbox.com/scl/fi/nagvbfttmeus4axu88aon/Whisper-STT-RAW-v2.cs?rlkey=ok9lx6bfs9sqso62gqgp9hqbf&dl=0

====================================================================================

The only caveat is that it requires quite a bit of effort to get whisper installed, it depends on Python, GIT, FFmpeg, etc. and setting of environment variables. So, you need to install a bunch of supporting stuff before you can use whisper. But it is doable. For this purpose, I have put together a document on how to use and install whisper (and its dependent programs), it has all the links to get you up and running.

Here is the link to the document on Whisper openAI:

https://www.dropbox.com/s/dh62ripb58xth86/AI%20whisper.docx?dl=0

==================================================================================

update history

Update 18/12/2023: "Whisper Speech To Text v7"

add backward compatibility for (old "Sony" Vegas versions UI plugin naming) scripting in Vegas 14,15,16 (Note: only tested in Vegas 21 - not tested in 14, 15 or 16)

Update 14/12/2023: "Whisper Speech To Text v6" & "Whisper STT RAW v2":

sets timeline timecode format to "Time" to prevent subtitle discrepancies on the timeline and reset back to the original user's preference after the subtitle insert

Update 13/12/2023: "Whisper Speech To Text v5":

Compatibility update: solves subtitle insert fail issue some users have because of different results from whisper:
for those that whisper saves filename + media type extension + .srt
for those that whisper saves filename + .srt (without media type extension .wav .mp4 etc...)

On request here is a variant of the "Whisper Speech To Text v5" script called "Whisper STT RAW v1":

same script as v5 with the exception that it keeps the original sentence layout "as is" of the .srt file without automatically adding a newline (like a word wrap) after 9 words.

Update 11/12/2023: "Whisper Speech To Text v4":

Small update (but also a big improvement and bug fix for some users to support speech to text when the drive location of the audio media is not located on the same drive as the Vegas project.

Update 10/12/2023: "Whisper Speech To Text v3":

Very small update to support audio filenames with spaces in their names

Update 29/11/2022: Whisper Speech To Text v2":

Major update: I made a new improved script with UI to select different transcode model options, a translate option, and a UI option to import the subtitles from the generated files to a new track

11 November 2022: Original script "Whisper Speech To Text":

Have fun!

Back to post

Comments

Former user wrote on 3/4/2023, 9:03 PM

Subtitle Edit is ready to go for these implementations so whenever we get a fixed or functional new one it can be added

It's great software, I hope if they make the change to GPU processing for whisper, it doesn't get the same problems as the GPU whisper versions.

Maybe run it by RX10 first, separate out the person you need, and use whisper next.

Great idea! 👍

RogerS wrote on 3/5/2023, 1:22 AM

Fora a non-python Whisper that does CPU or GPU you can grab it here: https://github.com/Purfview/whisper-standalone-win

It's not working in SubtitleEdit at the moment but works from the command prompt (run cmd as admin). It doesn't seem to have repeated lines.

Save it somewhere, dump ffmpeg.exe to the folder with whisper.exe, change the command prompt folder to there "cd C:\Whisper\" for example. Try this as a template (you can changer the language, location and model type).

whisper.exe --device cuda --language en --model "base" "C:\Videos\video name.mp4"

Former user wrote on 3/14/2023, 1:53 AM

This is the whisper variant i'm using currently https://github.com/Dadangdut33/Speech-Translate/releases/tag/1.1.0

It seems pretty good, in this example created subtitles for a 3min video using large dictionary in 1 minute (rtx 3080) . It gets things almost perfect until about 2 minutes where timing begins to be affected. I thought others interested in translation could use this as a barometer of sorts, and even download this video and compare the app version they're using. Russian historically has been difficult for whisper to do a good job at, If the whisper version you're using does a better job let us know

This uses GPU, maybe only Nvidia. It has no integration with any NLE.

Former user wrote on 3/27/2023, 11:31 PM

I tried the new version of StoryToolKit (Nvidia GPU only) by downloading the video in my last message and re-translating it. Top subs are the new translation. https://github.com/octimot/StoryToolkitAI/releases/tag/v0.17.16

It doesn't have the same timing problems seen with Speech-Translate, but as a negative it's formatting not as good, and instead of using multiple shorter sentences it seem to like to form paragraphs instead. Neither perfect options, but StoryToolKit in standalone mode (for Vegas users) possibly better choice, just need to break up the sub paragraphs manually where required

If your whisper translator does a better job, please share

wwaag wrote on 9/5/2023, 1:43 PM

Just wrote a new Batch WhisperAI Speech to Text tool and created a new thread. Here's the link https://www.vegascreativesoftware.info/us/forum/happyotter-batchwhisperai-speech-to-text--142423/

joelsonforte.br wrote on 12/9/2023, 11:47 PM

@bitman

I was doing some tests with your script and noticed some points of improvement.

1.Transcription only works if the files are on the C:\ drive. If the files are on another drive, D:\ for example, transcription will not work.

2.The script does not work when names are separated by spaces. For example: The file name "Video 01.mp4" must be "Video01.mp4", "Video_01.m4" or Video-01.mp4 in order to be processed correctly.

3.After the transcription process finishes, an error occurs when trying to add the SRT file to the Timeline as Text Events.

This error occurs because the script is looking for a different file name than the file generated by Whisper. For example: Whisper generates a srt output file called "Video_01.srt" but the script is looking for a file called "Video_01.mp4.srt" (The original file extension .mp4 is being added to the file name).

See my screen recording to understand better.

Can you please correct the script when you can, or let me know what needs to be changed to fix this.

Thanks!

bitman wrote on 12/10/2023, 9:09 AM

@joelsonforte.br I have adapted the script to support spaces in the audio filenames, it is in version 3, you can download it from the start page in this post. It is just a one line change, you could also just adapt the v2 script (around line148):

sw.WriteLine("whisper " + myFile + modelOption); //temp remove for speed testing rest of APP

to add stuff like + "\"" in the argument, this will avoid the argument being escaped prematurely!

sw.WriteLine("whisper " + "\"" + myFile + "\"" + modelOption); //temp remove for speed testing rest of APP

joelsonforte.br wrote on 12/10/2023, 2:51 PM

Thanks @bitman. The item 2 is solved. Can you see items 1 and 3 when you can?

1.Transcription only works if the files are on the C:\ drive. If the files are on another drive, D:\ for example, transcription will not work.

3.After the transcription process finishes, an error occurs when trying to add the SRT file to the Timeline as Text Events.

This error occurs because the script is looking for a different file name than the file generated by Whisper. For example: Whisper generates a srt output file called "Video_01.srt" but the script is looking for a file called "Video_01.mp4.srt" (The original file extension .mp4 is being added to the file name).

See my screen recording to understand better.

bitman wrote on 12/11/2023, 7:30 AM

@joelsonforte.br

Version 4 should fix your issues! See post start.

Latest Update 11/12/2023:

I made a small update (but also a big improvement and bug fix for some users @Joelson) to support speech to text when the drive location of the audio media is not located on the same drive as the Vegas project.

By the way, text to speech media and Vegas project on the same, but another drive than C: did work in the previous versions (I tested this, hence some confusion), but apparently not when the Vegas project itself was on different drive then the Vegas media...

joelsonforte.br wrote on 12/11/2023, 12:54 PM

@bitman Thank you for more this fix.

Now only item 3 is missing.

3.After the transcription process finishes, an error occurs when trying to add the SRT file to the Timeline as Text Events.

This error occurs because the script is looking for a different file name than the file generated by Whisper. For example: Whisper generates a srt output file called "Video_01.srt" but the script is looking for a file called "Video_01.mp4.srt" (The original file extension .mp4 is being added to the file name).

See my screen recording to understand better.

It is strange that the script looks for the name of a file other than the file generated by Whisper to insert the SRT file as a text event in the timeline. This is why the error below occurs.

Whisper generates > File name + SRT Extension
The script searches > File name + File extension + SRT Extension

To fix this, the script needs to look for the correct file generated by Whisper. In this case: File name + SRT Extension.

I'm on the second day trying to find a solution for this but without success. 😂😂😂

bitman wrote on 12/11/2023, 2:29 PM

@joelsonforte.br Strange, but the filename + file type extension + SRT extension is the correct way of the script. On my PC, whisper generates the above, and script uses the above and it just works...

joelsonforte.br wrote on 12/11/2023, 2:57 PM

@bitman

It really is strange. See my screen recording. The file generated by Whisper does not have the extension of original file. It's just the original File Name + .srt Extension.

The Script message shows that it is looking for File Name + File Extension + .srt Extension.

I never changed the Whisper settings. Maybe it's a language problem.

Do you know how I can change the script so it works for me in this situation? I've tried Chat GPT , Git Hub, Stack Overflow and almost all over the internet and I haven't figured out how to modify the script to make it work for me. 😂😂😂

bitman wrote on 12/11/2023, 3:19 PM

I will have a look tomorrow for a specific solution for you if possible, it is getting late in Belgium's timezone!

joelsonforte.br wrote on 12/11/2023, 7:39 PM

Thanks @bitman

I'll be anxiously waiting and hoping there's a solution. Your script is very good and was very well written. He shows the step by step in detail and I'm learning a lot by watching his code.

bitman wrote on 12/12/2023, 7:47 AM

@joelsonforte.br I have a version v5 of the script ready. It should work for both our machines,

for those that whisper saves filename + media type extension + .srt
for those that whisper saves filename + .srt (without media type extension .wav .mp4 etc...)

Not sure why whisper works differently, maybe you have an other version or a different install of all the stuff that is needed to make whisper work.

Anyway, solution was to copy the .srt text file without file extension into an .srt file with media type extension so the rest of the script would work (but only in case the file did not exist via a stripping .srt and reconstruct the full path + mediatype + .srt)

joelsonforte.br wrote on 12/12/2023, 8:42 AM

@bitman Uhuuuuu!

It works perfectly fine now!!! You're one of those people who gets it right the first time. Congratulations! It looks great!

I have a small question: How do I disable the option long subtitles are automatically split on a newline after 9 words?

I know how to configure the Whisper to use the native options --max_line_widtht and --max_line_count and so I won't need of this option.

bitman wrote on 12/12/2023, 9:08 AM

@joelsonforte.br You owe me a beer!

around line 649 in the v5 script if you open it with the free notepad++, you see the following:

if (spaces == 9) //seems optimal for ENGLISH

You can increase 9 with a higher number; this will allow more spaces in the line of text (crude method used to detect sentence length) before a newline is issued.

joelsonforte.br wrote on 12/12/2023, 8:44 PM

@bitman See this screenshot bellow.

The Track 01 are Text Events created using the native Vegas "Import Subtitles from File" option. (Vegas use the original lenghts of srt file.)

The Track 02 are Text Events created using the script. (Apparently, in some situations the script makes changes to the length of Text Events created from the srt file.)

How do I make the script create Text Events with the original lenght of the srt file without modifications? so that it has exactly the same duration as in the srt file. It is possible?

jetdv wrote on 12/12/2023, 9:26 PM

@joelsonforte.br, One thing you probably need to do is make sure your timeline timecode format matches the SRT file (i.e. Time). See if that makes a difference. If it does, the script can change to that format and then change back to the current format at the end as we did with the other scripts you were working with.

public RulerFormat OrgRulerFormat;

            OrgRulerFormat = myVegas.Project.Ruler.Format;
            myVegas.Project.Ruler.Format = RulerFormat.Time;
            myVegas.UpdateUI();

                myVegas.Project.Ruler.Format = OrgRulerFormat;

joelsonforte.br wrote on 12/13/2023, 5:41 AM

@jetdv

The srt file is correct. It is only when it is imported as text events by the script that the change occurs. If the same srt file is imported as text events directly by Vegas, everything is normal. I think this occurs because at the time @bitman wrote the script there was a lot of inconsistency in the duration times generated by Whisper, and he tried to correct this as best as possible, but today with updates this practically no longer happens.

I just need to know how to configure the script to import the srt file as text events with the original times. You know I'm kind of dumb with scripts and I get lost without the right guidance.

bitman wrote on 12/13/2023, 8:21 AM

@joelsonforte.br I have added a new script "Whisper STT RAW v1 (see beginning of post), this is basically the same v5 script, but omits the word wrap optimization's after 9 words, and as such keeps the original srt layout.

joelsonforte.br wrote on 12/13/2023, 9:37 AM

Hi @bitman

I tested the Whisper STT RAW script.

I did a little test with Whisper new "word-Level" feature and this was the result. For some reason the script still imports the srt file into the timeline with different times of the original. See my screen record bellow.

I'm sending the video I used in the test and also the generated files for you to check.

https://drive.google.com/file/d/1ocMZRCJ-kK85aVS2K_Z-Es6TxfiGWLaB/view?usp=sharing

joelsonforte.br wrote on 12/13/2023, 7:44 PM

@bitman

I have great news.

The problem was caused by the time format of the timeline. When I changed the timeline format to "Time" the problem was resolved.

Is it possible to modify the script so that it is not necessary to change the timeline format to Time?

If you want the Whisper STT RAW script it is not necessary. Because this works well in Whisper Speech To Text V5

jetdv wrote on 12/13/2023, 9:17 PM

Here's the changes needed to switch it to "Time" and then back to whatever it was:

https://www.vegascreativesoftware.info/us/forum/speech-to-text-via-whisper-openai--137928/?page=3#ca900398