Speech to Text via Whisper openAI

bitman wrote on 11/11/2022, 9:40 AM

Some good news, I have been diving into an alternative to "Vegas Pro Speech to Text". Although it is a very fine feature and easy to use, it has some drawbacks: first of all, it is a Vegas 365 - only feature (I hope this may change). Lots of people will not be able to use it - those without a subscription. Like all AI based stuff it is dependent on the model and results may vary. It also lacks a way to tune for quality and language. I suspect it also being more favorable to English.

So here is an alternative: Whisper openAI.

I have created a simple Vegas script to call whisper and convert speech to text. Just place the cursor over an event on the timeline and the script will create result files with text. In a future version I can extend this to create subtitles from these result files on the timeline, feel free to add this or add more of the whisper capabilities like quality, language and translation options. Refer to the document on Whisper at the bottom of this post.

Latest Update 18/12/2023:

Whisper Speech To Text v7":

add backward compatibility for (old "Sony" Vegas versions UI plugin naming) scripting in Vegas 14,15,16 (Note: only tested in Vegas 21 - not tested in 14, 15 or 16)

Here is the link to the latest Vegas script called "Whisper Speech To Text v7":

https://www.dropbox.com/scl/fi/bpj52eo7hjoqda78upi9u/Whisper-Speech-To-Text-v7.cs?rlkey=l0elg1wjl3qi4051uuuscwdws&dl=0

older scripts:

Whisper Speech To Text v6":

sets timeline timecode format to "Time" to prevent subtitle discrepancies on the timeline and reset back to the original user's preference after the subtitle insert

Here is the link to the latest Vegas script called "Whisper Speech To Text v6":

https://www.dropbox.com/scl/fi/5t7pdc16rd3ab65ey2a3r/Whisper-Speech-To-Text-v6.cs?rlkey=y3ueerkpurf6o03amobf3xzkr&dl=0

"Whisper STT RAW v2" (variant):

same script as v6 with the exception that it keeps the original sentence layout "as is" of the .srt file without automatically adding a newline (like a word wrap) after 9 words.

Here is the link to the latest Vegas script called "Whisper STT RAW v2":

https://www.dropbox.com/scl/fi/nagvbfttmeus4axu88aon/Whisper-STT-RAW-v2.cs?rlkey=ok9lx6bfs9sqso62gqgp9hqbf&dl=0

====================================================================================

The only caveat is that it requires quite a bit of effort to get whisper installed, it depends on Python, GIT, FFmpeg, etc. and setting of environment variables. So, you need to install a bunch of supporting stuff before you can use whisper. But it is doable. For this purpose, I have put together a document on how to use and install whisper (and its dependent programs), it has all the links to get you up and running.

Here is the link to the document on Whisper openAI:

https://www.dropbox.com/s/dh62ripb58xth86/AI%20whisper.docx?dl=0

==================================================================================

update history

Update 18/12/2023: "Whisper Speech To Text v7"

add backward compatibility for (old "Sony" Vegas versions UI plugin naming) scripting in Vegas 14,15,16 (Note: only tested in Vegas 21 - not tested in 14, 15 or 16)

Update 14/12/2023: "Whisper Speech To Text v6" & "Whisper STT RAW v2":

sets timeline timecode format to "Time" to prevent subtitle discrepancies on the timeline and reset back to the original user's preference after the subtitle insert

Update 13/12/2023: "Whisper Speech To Text v5":

Compatibility update: solves subtitle insert fail issue some users have because of different results from whisper:
for those that whisper saves filename + media type extension + .srt
for those that whisper saves filename + .srt (without media type extension .wav .mp4 etc...)

On request here is a variant of the "Whisper Speech To Text v5" script called "Whisper STT RAW v1":

same script as v5 with the exception that it keeps the original sentence layout "as is" of the .srt file without automatically adding a newline (like a word wrap) after 9 words.

Update 11/12/2023: "Whisper Speech To Text v4":

Small update (but also a big improvement and bug fix for some users to support speech to text when the drive location of the audio media is not located on the same drive as the Vegas project.

Update 10/12/2023: "Whisper Speech To Text v3":

Very small update to support audio filenames with spaces in their names

Update 29/11/2022: Whisper Speech To Text v2":

Major update: I made a new improved script with UI to select different transcode model options, a translate option, and a UI option to import the subtitles from the generated files to a new track

11 November 2022: Original script "Whisper Speech To Text":

Have fun!

Back to post

Comments

Subtitler22 wrote on 12/17/2022, 7:56 AM

@bitman thank you so much for the Whisper instructions. I got it to work with a bit of struggle, but it was totally worth the efforts. I am transcribing using cmd and it seems to take a lot of resources with high CPU usage. I had to restart few times and start again. I will probably split long videos into small parts. Not using --model seems to work best. My PC has Intel i5 so perhaps I need to upgrade to i7 or i9

bitman wrote on 12/19/2022, 4:21 AM

@Subtitler22 I am glad you like it, if I am not mistaken, using the default (= not using the --model argument) is using the "small" model which is a step higher (=better, but slower) than model "base". The higher the model chosen, the slower the process becomes and the more VRAM is required, but the better the accuracy becomes.

In my v2 script (Whisper Speech To Text v2) I use the model "small" as the "balanced" selection, I guess I could just have omitted the --model in that case.

Subtitler22 wrote on 12/19/2022, 6:13 AM

@bitman I have tested few languages on the --model small and they seem to be accurate enough. with --model base they are just not acceptable. My problem now is that my computer running i5 is just not good enough to handle Whisper and Python. I need to restart and not use any other applications apart from cmd or Powershell and the Task Manager to check resources performance.
Are there any recommendations for which CPU and GPU to use in a new computer build to cope with --model large just in case I want to cover all possible options for Whisper?
How does Vegas Pro 20 360 Speech To Text (Non English languages) compares to Whisper as regards to accuracy? Perhaps it is cheaper to just pay for the subscription instead of building a new computer.

bitman wrote on 12/20/2022, 3:41 AM

@Subtitler22 It is definitely cheaper to subscribe to Vegas 365 than to build a new PC. On the other hand you will obviously enjoy working with Vegas and other applications more with better hardware.

On the subject of accuracy for foreign languages I can say with my own experience with Dutch (the Flemish Belgian variant of Dutch) that Whisper is far superior to what Vegas 365 offers (via Microsoft azure). In fact the automatic language detection often fails with an error popup on Vegas 365 after analysis. I have to indicate specifically Dutch (Belgian) for it to work on 365, and the result is worse than whispers default --model small.

The fact alone that you cannot tune the accuracy on Vegas 365 current speech to text application is often a deal breaker.

There may also be a privacy concern; if I am not mistaken, Vegas 365 uploads your speech to external servers (probably Microsoft) so they can be processed in the cloud with obvious benefits to put the heavy burden of transcription not on your own limited hardware, but on their powerful hardware. The downside to this, is that it will tax your network and is a privacy concern as your speech is on their servers.

Whisper on the other hand, will download the model in memory (once), then process everything locally, which obviously taxes your own system much more than Vegas 365; as you already have experienced.

Subtitler22 wrote on 12/20/2022, 5:20 AM

@bitman Thanks a lot for the detailed information. I realized yesterday that I was using my CPU (Intel i5) to do the transcription task and NOT the GPU (mine is nvidia GT 1030 2GB), which is supposed to be faster.
To make the GPU do the work, I had to use PyTorch and add --device CUDA in order for it to work with the GPU.
With 2GB, I got it to work really fast on the tiny and base models but with the model small (needs 4GB) it didn't work. In order to use --model small I had to add --device cpu.

In short, instead of building a new computer, I will probably buy a GPU with at least 4GB. I can always use the new GPU on a new PC build in the future.

bitman wrote on 12/20/2022, 7:32 AM

@Subtitler22 I must thank you for the CUDA idea, (to use the GPU to accelerate things up). Until now I always ran whisper via command line or the Wisper Vegas script without the extra argument "--device cuda". I vaguely recall I read something about acceleration, but I did not pursue it, I was happy that whisper "as is" worked in the first place. I am looking into it.

Subtitler22 wrote on 12/20/2022, 9:26 AM

@bitman credit goes to this YouTube video. I think I will buy a 12GB GPU to cover all options.

bitman wrote on 12/20/2022, 10:04 AM

@Subtitler22 Here is an update on the use of CUDA. Some observations:

The extra argument --device CUDA is wrong using caps; you have to use --device cuda (not in capitals)
You have to have PyTorch installed to make use of cuda
If you have PyTorch installed, you do not need the argument --device cuda for whisper, as it will use PyTorch and cuda by default; this means I do not have change the current script (v2) to enjoy the GPU acceleration.
If you have PyTorch installed and still want to use the CPU, you can use --device cpu

After having installed PyTorch, the whisper acceleration with cuda is impressive:

I ran a quick test on my 18s Dutch audio sample on my PC, Pytorch installed, with the "--model large" (="Best" in my script):

without GPU acceleration (with argument --device cpu): 109 seconds

default without or with argument (--device cuda): 18 seconds (6x faster)

bitman wrote on 12/20/2022, 12:08 PM

@Former user @joelsonforte.br

I redid and updated the "kingfisher" benchmark you can find a bit earlier in this post after installing PyTorch, the speed improvement is spectacular.

Subtitler22 wrote on 12/26/2022, 10:51 AM

@bitman @joelsonforte.br @Former user

I ran into a problem using whisper when there was a long section with no speech and when there was speech again it just didn't transcribe it and kept repeating the last transcripted text.
From the help menus there is this option --no_speech_threshold which has a default value of 0.6
After expereminting with lower values down to 0.275 this seems to help get it back on track. It took longer to transcribe but it was a small price to pay to get it working again.

If you get into a similar situation just add --no_speech_threshold 0.275 or any other values that might work for you.

RogerS wrote on 2/6/2023, 7:56 PM

Would this new app be easier to integrate into Vegas than the current mix of files? It's called WhisperDesktop, has sourcecode available and here's a video of it in use:

I'm getting 30fps on a NVIDIA 1050 which is so fast.

Dave-Wallin-Eddy wrote on 2/15/2023, 11:19 PM

Would this new app be easier to integrate into Vegas than the current mix of files? It's called WhisperDesktop, has sourcecode available and here's a video of it in use:

I'm getting 30fps on a NVIDIA 1050 which is so fast.

I had hopes on WhisperDesktop being great but I tried it on 3 different computers and it simply crashes out when loading the "models". On the other hand the StoryToolkitAI is working fine. If code could be added to it to detect Vegas and not just Resolve it would be great(er).

RogerS wrote on 2/15/2023, 11:57 PM

Interesting, perhaps load the models manually? I have it working on a GTX 1050 (mobile) and RTX 2080 (desktop). Const-Me has also now been integrated into the latest Subtitle Edit beta. https://github.com/SubtitleEdit/subtitleedit/releases

Subtitler22 wrote on 2/22/2023, 1:52 PM

I have been busy trying to figure out how to improve on using Whisper AI. I thought if I can get only the vocals from a video file without any surrounding noises or music, then this might help to make a better transcriptions. I found a useful feature in iZotope RX 10 Standard called Music Rebalance that can isolate the vocals from the other sounds. I did some tests with audio files and it is definitely making an improvement.

Dave-Wallin-Eddy wrote on 2/23/2023, 9:23 PM

@bitman

FYI...your V2 of the script seems to load fine. Opens the GUI and lets me select what I want but it "ends" near instantly. Where does your script save the srt file to? Is is not in the same directory as the video file. Maybe something I am (not) doing?

And this is the error:

---------------------------------------------------------
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.IO.FileNotFoundException: Could not find file 'I:\IS2013.mp4.srt'.
   at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
   at System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
   at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options, String msgPath, Boolean bFromProxy, Boolean useLongPath, Boolean checkHost)
   at System.IO.StreamReader..ctor(String path, Encoding encoding, Boolean detectEncodingFromByteOrderMarks, Int32 bufferSize, Boolean checkHost)
   at System.IO.StreamReader..ctor(String path, Encoding encoding)
   at EntryPoint.MakeLinkedList(Vegas myVegas, String myPathPlusFileName)
   at EntryPoint.FromVegas(Vegas vegas)
   --- End of inner exception stack trace ---
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
   at System.Reflection.RuntimeMethodInfo.UnsafeInvokeInternal(Object obj, Object[] parameters, Object[] arguments)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at ScriptPortal.Vegas.ScriptHost.ScriptManager.Run(Assembly asm, String className, String methodName)
   at ScriptPortal.Vegas.ScriptHost.RunScript(Boolean fCompileOnly)

fr0sty wrote on 2/23/2023, 9:43 PM

There's something in the works soon that will enable all users of VEGAS, sub or perpetual, to access this and other VEGAS Hub features.

bitman wrote on 2/24/2023, 5:48 AM

@Dave-Wallin-Eddy The srt file is located in the same directory location as the audio source; however, I was able to reproduce the same issue you have when placing source audio on another drive than the C-drive (and consequently the srt file fails to save). I see you have your video (or audio) source on the root of "I" - drive which is different than a folder on C: drive. I tested the script with audio sources on folders on the C-drive. Possibly the script or its installed stuff to make whisper work, has issues when the source is not on the C-drive...

Try to put your source audio on a folder in the C-drive. Also, make sure you repeat if it does not work from the first time; it may need time to download the models in memory first.

Dave-Wallin-Eddy wrote on 2/24/2023, 10:26 PM

Yes, after I posted this I did move the file to the C drive. But same exact error.occurred.

@Dave-Wallin-Eddy The srt file is located in the same directory location as the audio source; however, I was able to reproduce the same issue you have when placing source audio on another drive than the C-drive (and consequently the srt file fails to save). I see you have your video (or audio) source on the root of "I" - drive which is different than a folder on C: drive. I tested the script with audio sources on folders on the C-drive. Possibly the script or its installed stuff to make whisper work, has issues when the source is not on the C-drive...

Try to put your source audio on a folder in the C-drive. Also, make sure you repeat if it does not work from the first time; it may need time to download the models in memory first.

bitman wrote on 2/25/2023, 7:42 AM

@Dave-Wallin-Eddy Maybe a silly question on my part, but did you install all the rest that is needed for whisper to work? The Vegas whisper script is only the "hook" in Vegas to provide input for whisper (and if needed insert subtitles in Vegas).

You need to install a lot of extra stuff such as FFmpeg, Python, Git and whisper (via Git) itself. All is explained in the document (ref. link at the beginning of this post).

Even if you have installed all the executables, make sure you do not forget to adapt the environment variables path for them - so the system can find the installed executables so they can be called from the folder you have your audio).

Former user wrote on 3/3/2023, 10:58 PM

@bitman would you know if whisper can separate people, even if currently not implemented but the data is there?

Bob: What a nice day!

John: It sure is Bob!

Claire: What a day to be alive!

It is the only reason I use Premiere for captions and transcripts involving more than 1 person.

@RogerS I tried that version you're using. I found it to be very fast but gave the most errors, how have you found it and what model are you using now?

RogerS wrote on 3/3/2023, 11:23 PM

Hi @Former user I don't know about separating people- I now have 3 iterations of Whisper on my system in SubtitleEdit (Open AI standalone, Const-me and CPP). You could ask on any of their GitHub pages.

I found the Const-me one useful for English with the medium or large model and quick enough even on my laptop GPU. I recently did a 10-minute video I have been procrastinating on subtitles for years and it was close to perfect right in SubtitleEdit.

For Japanese it messes up and repeats lines too much. Others reported the same on GitHub so I'm hopeful there's an update.

At the moment all these implementations seem to be in flux so I'm hopeful there will be bugfixes forthcoming.

Former user wrote on 3/3/2023, 11:40 PM

For Japanese it messes up and repeats lines too much. Others reported the same on GitHub so I'm hopeful there's an update.

At the moment all these implementations seem to be in flux so I'm hopeful there will be bugfixes forthcoming.

@RogerS I tried it with translation a number of times, I get the same. It stops translating and repeats the same line. Glad it's a known problem that will be addressed soon. That can also occur with the Resolve version, just not as frequently.

RogerS wrote on 3/4/2023, 3:33 AM

I don't know if it will be addressed soon, Const.me doesn't seem to be in active development and this bug was likely inherited from CPP, which will hopefully address it. I downloaded the third option though haven't really tested it.

Subtitle Edit is ready to go for these implementations so whenever we get a fixed or functional new one it can be added.

bitman wrote on 3/4/2023, 5:13 AM

@bitman would you know if whisper can separate people, even if currently not implemented but the data is there?

@Former user Not that I recall, I do know that you can separate voice with different people in izotope RX10 advanced. Maybe run it by RX10 first, separate out the person you need, and use whisper next...

section at 8:23 (text navigation and multi speaker detection):