HappyOtter BatchWhisperAI Speech to Text

wwaag wrote on 9/5/2023, 1:39 PM

I've just written a new free tool that I want to share, Batch WhisperAI. Here's the current dialog.

It stems largely from this thread https://www.vegascreativesoftware.info/us/forum/speech-to-text-via-whisper-openai--137928/?page=1

and a specific request by @bitman https://www.vegascreativesoftware.info/us/forum/speech-to-text-via-whisper-openai--137928/?page=1#ca863446 and also my need to quickly create a text file from lots of audio recordings that I had made earlier this year documenting my road trip to Alaska.

Here is a DropBox link where the app may be downloaded which also includes the WhisperAI models and instructions for installation. https://www.dropbox.com/sh/qpjskuyq6bw6i2f/AAAjKPuBid2ggK2gzh4z29Nka?dl=0 For those with an Nvidia card, the app automatically selects GPU rather than CPU processing. Note that GPU processing is available only for Nvidia cards. Just make sure that you download and install the required Cuda files. Here is a link to the app, Whisper-faster, that actually does the speech to text processing. https://github.com/Purfview/whisper-standalone-win

And finally, here's a short demo I prepared using some video files I created some years ago. Note that the UHD demo is also available in the DropBox folder and can be downloaded for better viewing.

I will eventually add this tool to the HappyOtter Free Tools Library. Comments and suggestions for improvements are welcome. Enjoy.

ADDED: 2023-09-12

The Batch WhisperAI app has now been included in the latest version of HappyOtterScripts. For ease of installation, just run the setup file--no paid license is required. https://www.vegascreativesoftware.info/us/forum/happy-otter-scripts-for-vegas-pro--113922/?page=44#ca891774

 

Last changed by wwaag

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

Comments

John-Ivar wrote on 9/5/2023, 3:42 PM

Amazing work! This is very intriguing to me, I will check it out.

mark-y wrote on 9/5/2023, 5:18 PM

I got this error.

wwaag wrote on 9/5/2023, 5:26 PM

@mark-y

There is no zip file in the DropBox folder. Don't know where you got this file. Did you ask DB to create a zip perhaps?? If so, don't. Given the large size of some of files, it would take a very long time anyway.

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

mark-y wrote on 9/5/2023, 5:31 PM

It gave two links -- Dropbox and Download.

But I don't have Dropbox on this machine, so I'll install it and let you know. Looking forward to trying this!

mark-y wrote on 9/5/2023, 6:02 PM

@wwaag Sent you a PM

mark-y wrote on 9/6/2023, 9:44 AM

@wwaag Got it working and it's great! Thanks!

wwaag wrote on 9/6/2023, 9:07 PM

I've uploaded a new version of Batch WhisperAI. Aside from a few bug fixes, it now has the option to force use of CPU for processing for those with Nvidia cards. This would enable users to try it without having to download the additional Nvidia library files.

However, use of an Nvidia GPU really speeds up processing. On an older system (i7-3770K with an 1050ti) processing of a roughly 40 min MKV file was reduced from 5:10 to 1:27 which was significantly faster than CPU processing on my i7-8750K at 2:14 which only has an Intel card.

 

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

wwaag wrote on 9/7/2023, 2:14 PM

You can now download the WhisperAI models directly from the HOS website. Here's the link. https://tools4vegas.com/download-whisperai-models/

To simplify installation, the next build of HappyOtterScripts (hopefully this weekend) will include include the Batch WhisperAI app, base model, and the required Nvidia library files. You only need the FREE license to use. Once installed, you can add additional models by downloading from the above links.

Last changed by wwaag on 9/7/2023, 2:23 PM, changed a total of 1 times.

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

mark-y wrote on 9/7/2023, 5:09 PM

The 'base' model you suggested is working just fine in my early tests. Is there any advantage to trying the larger ones?

wwaag wrote on 9/7/2023, 8:10 PM

@mark-y

Good question. The advantage of the larger models should be improved accuracy, although at a price of incresed processing time. Just ran a 44 minute video I did some years ago that included a voiceover on my i7-8750K, cpu only. As expected, processing time increases in proportion to the size of the model. There are also wide differences in the number of subtitles that are created with the larger models producing more. Here are the numbers.

Tiny 1:26 - 253
Base: 2:12 - 243
Small: 6:02 - 362
Medium: 16:40 - 331
Large-V2: 30:19 - 382

If I were to actually add the subtitles, then the Large-V2 would be my choice although it would still be necessary to "proof" the text. One thing for sure, WhisperAI does not understand Welsh (nor does it claim to). But for just general applications, like creating text from voice memos on your iPhone, the base model would be adequate. Bottom line--if you're happy with the base model, stay with it.

 

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

RogerS wrote on 9/7/2023, 8:16 PM

I use it for transcribing interviews and pretty much only use large as the computer's time is worth less than mine. Large does much better with proper nouns in English.

In Japanese overall quality for all the models is lower so I think you'd want to stick with larger ones.

With NVIDIA CUDA the processing times aren't so bad, even with a 10XX series card.

wwaag wrote on 9/7/2023, 8:40 PM

"Large does much better with proper nouns in English."

+1 E.g Chugach Mountains in Large which is correct vs Chugak in Base

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

mark-y wrote on 9/7/2023, 8:58 PM

Thank you both! I just want to transcribe Zoom and Google Meet-ings in English. But my nephew has large vocabulary. I'll keep Large in mind if words get fudged.

RogerS wrote on 9/7/2023, 9:15 PM

Even for people and place names you'll likely notice a difference. But as Wwag points out it depends on the goal for the transcription. Just notes for yourself or something that needs to be accurate as it will be shared.

wwaag wrote on 9/7/2023, 9:31 PM

One anomaly that I've seen when using large models is the failure to punctuate correctly, break sentences, and capitalize correctly. Here's a very simple example of one of my Zoom recordings after a stop in Jade City, BC. Here's the short voice recording.

And the subtitles when using large-v2.

00:00:00,440 --> 00:00:08,460
just stopped at the Jade store in
Jade City for a look about all kinds of

2
00:00:08,460 --> 00:00:15,540
jewelry and stuff but boy it is
very very expensive so anyway I remember

3
00:00:15,540 --> 00:00:20,740
stopping here 20 years
ago so it hasn't changed much

Turns out that others have experienced the same problem. Here's a link to a discussion of the problem on the developer's Github. https://github.com/Purfview/whisper-standalone-win/discussions/45

Like the originator, I too found that going to a simpler model solved the problem, although not all the time.

Last changed by wwaag on 9/7/2023, 9:34 PM, changed a total of 1 times.

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

wwaag wrote on 9/12/2023, 2:56 PM

The Batch WhisperAI app has now been included in the latest version of HappyOtterScripts. For ease of installation, just run the setup file--no paid license is required. https://www.vegascreativesoftware.info/us/forum/happy-otter-scripts-for-vegas-pro--113922/?page=44#ca891774

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

AH-Bueno wrote on 10/2/2024, 2:50 PM

I have a largeinterview that my camera cut up in 4Gb files. I'd like to process the whole interview and drag multiple files to the Batch Whisper|AI. I was wondering if it would continue timecode as if the files were in a series on the timeline if I check 'Join Text Files'. However, it appears it only processes the first file anyway, even though In the end it says 2 media files processed. Only one is processed, and nothing is joined.

AH-Bueno wrote on 10/2/2024, 3:31 PM

I found that it works smoothest when i render with debugmode frameserver and feed that to WhisperAI, works pretty fast.

wwaag wrote on 10/5/2024, 12:11 AM

@AH-Bueno

I presume that your 4gb files include video and are not audio-only. If so, my suggestion is to add all your files to the Vegas timeline and render just the audio in a wav format. Then input the rendered wav file into Batch WhisperAI. Doing it this way negates the requirement for inputting multiple files although that feature does work in recent testing.

I'm curious. When using DFMS, do you simply input the signpost avi file? I'll have to try it. Thanks for the info.

AKA the HappyOtter at https://tools4vegas.com/. System 1: Intel i7-8700k with HD 630 graphics plus an Nvidia RTX4070 graphics card. System 2: Intel i7-3770k with HD 4000 graphics plus an AMD RX550 graphics card. System 3: Laptop. Dell Inspiron Plus 16. Intel i7-11800H, Intel Graphics. Current cameras include Panasonic FZ2500, GoPro Hero11 and Hero8 Black plus a myriad of smartPhone, pocket cameras, video cameras and film cameras going back to the original Nikon S.

AH-Bueno wrote on 10/6/2024, 8:46 AM

Inputting Multiple didn't work for me, only processed one. But DMFS worked surprisingly smooth. I will keep working like that, it saves the trouble of making an intermediate render.

With my GTX4060TI it subtitled 45 minutes of interview (which might have been edited, makes no difference) in a bit more than 3 minutes. I fed batchwhisperAI simply the rendered .avi file not the signpost.