Speech to Text via Whisper openAI

bitman wrote on 11/11/2022, 9:40 AM

Some good news, I have been diving into an alternative to "Vegas Pro Speech to Text". Although it is a very fine feature and easy to use, it has some drawbacks: first of all, it is a Vegas 365 - only feature (I hope this may change). Lots of people will not be able to use it - those without a subscription. Like all AI based stuff it is dependent on the model and results may vary. It also lacks a way to tune for quality and language. I suspect it also being more favorable to English.

So here is an alternative: Whisper openAI.

I have created a simple Vegas script to call whisper and convert speech to text. Just place the cursor over an event on the timeline and the script will create result files with text. In a future version I can extend this to create subtitles from these result files on the timeline, feel free to add this or add more of the whisper capabilities like quality, language and translation options. Refer to the document on Whisper at the bottom of this post.

Latest Update 18/12/2023:

Whisper Speech To Text v7":

add backward compatibility for (old "Sony" Vegas versions UI plugin naming) scripting in Vegas 14,15,16 (Note: only tested in Vegas 21 - not tested in 14, 15 or 16)

Here is the link to the latest Vegas script called "Whisper Speech To Text v7":

https://www.dropbox.com/scl/fi/bpj52eo7hjoqda78upi9u/Whisper-Speech-To-Text-v7.cs?rlkey=l0elg1wjl3qi4051uuuscwdws&dl=0

older scripts:

Whisper Speech To Text v6":

sets timeline timecode format to "Time" to prevent subtitle discrepancies on the timeline and reset back to the original user's preference after the subtitle insert

Here is the link to the latest Vegas script called "Whisper Speech To Text v6":

https://www.dropbox.com/scl/fi/5t7pdc16rd3ab65ey2a3r/Whisper-Speech-To-Text-v6.cs?rlkey=y3ueerkpurf6o03amobf3xzkr&dl=0

"Whisper STT RAW v2" (variant):

same script as v6 with the exception that it keeps the original sentence layout "as is" of the .srt file without automatically adding a newline (like a word wrap) after 9 words.

Here is the link to the latest Vegas script called "Whisper STT RAW v2":

https://www.dropbox.com/scl/fi/nagvbfttmeus4axu88aon/Whisper-STT-RAW-v2.cs?rlkey=ok9lx6bfs9sqso62gqgp9hqbf&dl=0

====================================================================================

The only caveat is that it requires quite a bit of effort to get whisper installed, it depends on Python, GIT, FFmpeg, etc. and setting of environment variables. So, you need to install a bunch of supporting stuff before you can use whisper. But it is doable. For this purpose, I have put together a document on how to use and install whisper (and its dependent programs), it has all the links to get you up and running.

Here is the link to the document on Whisper openAI:

https://www.dropbox.com/s/dh62ripb58xth86/AI%20whisper.docx?dl=0

==================================================================================

update history

Update 18/12/2023: "Whisper Speech To Text v7"

add backward compatibility for (old "Sony" Vegas versions UI plugin naming) scripting in Vegas 14,15,16 (Note: only tested in Vegas 21 - not tested in 14, 15 or 16)

Update 14/12/2023: "Whisper Speech To Text v6" & "Whisper STT RAW v2":

sets timeline timecode format to "Time" to prevent subtitle discrepancies on the timeline and reset back to the original user's preference after the subtitle insert

Update 13/12/2023: "Whisper Speech To Text v5":

Compatibility update: solves subtitle insert fail issue some users have because of different results from whisper:
for those that whisper saves filename + media type extension + .srt
for those that whisper saves filename + .srt (without media type extension .wav .mp4 etc...)

On request here is a variant of the "Whisper Speech To Text v5" script called "Whisper STT RAW v1":

same script as v5 with the exception that it keeps the original sentence layout "as is" of the .srt file without automatically adding a newline (like a word wrap) after 9 words.

Update 11/12/2023: "Whisper Speech To Text v4":

Small update (but also a big improvement and bug fix for some users to support speech to text when the drive location of the audio media is not located on the same drive as the Vegas project.

Update 10/12/2023: "Whisper Speech To Text v3":

Very small update to support audio filenames with spaces in their names

Update 29/11/2022: Whisper Speech To Text v2":

Major update: I made a new improved script with UI to select different transcode model options, a translate option, and a UI option to import the subtitles from the generated files to a new track

11 November 2022: Original script "Whisper Speech To Text":

Have fun!

Comments

Former user wrote on 11/11/2022, 9:33 PM

@bitman This is the Nvidia Resolve version that I"ve been testing, it will work in standalone to generate subtitles for Vegas. Maybe it would be possible for you or @jetdv to add scripting for Vegas. Playback is 2x speed . The file it's transcribing is approx 4m30s, it transcribes in 73seonds. Could you provide a transcribe benchmark for the version you're using.

And this is a Translation test. I chose a computer jargon heavy video as I thought that would show problems. I've only watched once, but only problem I noticed is "Kabylake" becomes "Capylake". Translation of 6 minute video took 68seconds. Not sure why the German translation to English was faster than English to English.

bitman wrote on 11/12/2022, 4:13 AM

@Former user Can you provide the above media files for download so I can benchmark?

As you have the resolve version of whisper, it may already be working for the Vegas script 'Whisper Speech to Text' I posted 'as is', you can try it out.

Former user wrote on 11/12/2022, 4:18 AM

@bitman https://github.com/octimot/StoryToolkitAI/releases/tag/v0.17.1

Your 3090 should eat this up

edit: oh media files, just in general interested in how well your version works. apparently there's multiple versions, and the one I"m using is slow in comparision

bitman wrote on 11/15/2022, 10:49 AM

@Former user I did some benchmarking on speech to text: first the standard 365 Vegas speech to text, then the Vegas script I posted but with extra arguments for the different models:

for example, to change the default multilanguage model in the script to "Tiny English only" model change

sw.WriteLine("whisper " + myFile);

sw.WriteLine("whisper " + myFile + " --model tiny.en");

bitman wrote on 11/15/2022, 10:52 AM

Here are the benchmark results (and accuracy scores):

Update 20/12/2022 Benchmark with GPU acceleration added after install of PyTorch)

Kingfisher.wav audio file converted to text: (.srt format)
==========================================
Vegas 365 Speech to text: 42 seconds

Whisper with model argument: suffix ".en" English only
Whisper Speech to text Vegas script (--model tiny.en): 17 seconds (6 seconds with cuda)
Whisper Speech to text Vegas script (--model base.en): 33 seconds (10 seconds with cuda)
Whisper Speech to text Vegas script (--model medium.en): 248 seconds (24 seconds with cuda)

Default model whisper (no arguments)
Whisper Speech to text Vegas script (*): 82 seconds (12 seconds with cuda)

Note (*): default model is multi-language "small"

bitman wrote on 11/15/2022, 10:59 AM

The kingfisher.wav audio file was generated via Vegas's own text to speech tool (English US, female voice Jenny) using a Wikipedia text about kingfishers. I used this generated wav file as the source for the speech to text benchmarks.

You can benchmark yourself, here is the original text:

Kingfishers or Alcedinidae are a family of small to medium-sized, brightly colored birds in the order Coraciiformes. They have a cosmopolitan distribution, with most species found in the tropical regions of Africa, Asia, and Oceania but also can be seen in Europe. They can be found in deep forests near calm ponds and small rivers. The family contains 114 species and is divided into three subfamilies and 19 genera. All kingfishers have large heads, long, sharp, pointed bills, short legs, and stubby tails. Most species have bright plumage with only small differences between the sexes. Most species are tropical in distribution, and a slight majority are found only in forests.

They consume a wide range of prey usually caught by swooping down from a perch. While kingfishers are usually thought to live near rivers and eat fish, many species live away from water and eat small invertebrates. Like other members of their order, they nest in cavities, usually tunnels dug into the natural or artificial banks in the ground. Some kingfishers nest in arboreal termite nests. A few species, principally insular forms, are threatened with extinction. In Britain, the word "kingfisher" normally refers to the common kingfisher.

bitman wrote on 11/15/2022, 11:32 AM

I noticed there is a bug in Vegas 365 native speech to text transcript: a complete sentence was omitted in the .srt file:

The family contains 114 species and is divided into three subfamilies and 19 genera.

In the kingfisher example, whisper was more accurate, did not have a missing sentence and was mostly faster (*)

(*) The speed depends on the model used as seen in the benchmark. Note that Whisper does need to download the model into cache first if the model is not being used before, which adds some seconds or minutes extra to download. However subsequent transcripts using the same model is faster as whisper does not need to download.

Former user wrote on 11/15/2022, 6:30 PM

Now that's a comprehensive benchmark! 😀👍

Would you have any idea what the lowest VRAM requirement is, and does that dictate the models that can be used? The large models is 3gig, i'm guessing it needs to all stay in Vram.

I was interested in how fast/well the Vegas version worked. Wonder if it's just not as good as whisper or they compromised by choosing speed over accuracy, using a smaller AI model.

So I guess we need a windows exe version coupled with a Vegas script. The route you took is beyond most people. I was following step by step instruction for installing the python version and failed to get it to work, mainly because I was blindly following a guide but had no idea what anything did, I'm guessing something was left out, or incorrect.

bitman wrote on 11/16/2022, 2:44 AM

@Former user I feel your frustration! It takes a bit of effort to install all of it to make whisper for Vegas work, but if you manage, you can also use it stand alone in windows. I am using windows 11 (22H2). Best is to follow the whisper document (link in the original post) which contains all the install instructions (if you have not already followed it).

One of the main reasons something does not work like python of FFmpeg after install is usually the environment variable path is not set for the application. This is typically something you sometimes must add manually (if it was not included in an install package). You cannot call an application from the windows console if you are not in the same spot (=directory) as where the application is stored - that is why you need to tell the path to the application in the environment variables so you can start the application from the console from everywhere.

joelsonforte.br wrote on 11/16/2022, 4:56 AM

@bitman Thanks for the procedure and the script... Unfortunately I couldn't make it work here.

@Former user The latest version of Subtitle Edit has a BETA option that uses Whisper to convert speech to text. It is necessary to follow some steps, and install Whisper using CMD to activate the option, but if you follow the instructions provided in Subtitle Edit when trying to use this option, I believe everything will work fine. If you have difficulties, let me know and I'll make a screen recording showing you step by step.

The biggest advantage of using Whisper in Subtitle Edit is that you can use all the tools of the Subtitle Edit to edit the .srt file.

Former user wrote on 11/16/2022, 4:56 PM

@joelsonforte.br That's really well written polished software and a pleasure to use and it's still beta. And very simple installation, I love how it downloads 3rd party modules it'self and installs, with no need to restart . This is most likely the answer currently for Vegas people who want automated AI generated subs but don't want a Vegas subscription.

The biggest advantage of using Whisper in Subtitle Edit is that you can use all the tools of the Subtitle Edit to edit the .srt file.

Yeah, turning a non integration negative into a positive, fix any errors in purpose built software before export. A Whisper problem seems to be when it experiences a lot of voices at one time, it can lose sync, but if you only imported a wav file you won't know it lost sync. A low resolution video could be used instead to asses sync issues with it's build in video player, if the goal it to export a perfect .srt without the need to edit it within vegas.

https://github.com/SubtitleEdit/subtitleedit/releases

Former user wrote on 11/18/2022, 7:36 PM

@joelsonforte.br Is it as it appears, you can't actually do direct translations via whisper in Subtitle Edit currently?

It looks like you have to use whisper to get AI generated subs in the original language, then use google translate to translate the text. There's 2 points of potential error instead of 1. I tried ticking the translate box in the whisper prompt but it says 'no text'

bitman wrote on 11/19/2022, 4:47 AM

@Former user @joelsonforte.br translation in Vegas using my Vegas script "whisper speech to text" works with automatic translation to English (if you add -- task translate in the original script):

Below is the link to a video how the script was able to translate from German

https://www.dropbox.com/s/ms42ljnslel210p/German.mp4?dl=0

bitman wrote on 11/19/2022, 5:04 AM

@wwaag Maybe just a thought, you could add whisper to the happy otter toolset, seems like a perfect match to me! It certainly would benefit from easier installation and UI panels!

joelsonforte.br wrote on 11/19/2022, 9:03 AM

One of the main reasons something does not work like python of FFmpeg after install is usually the environment variable path is not set for the application. This is typically something you sometimes must add manually (if it was not included in an install package). You cannot call an application from the windows console if you are not in the same spot (=directory) as where the application is stored - that is why you need to tell the path to the application in the environment variables so you can start the application from the console from everywhere.

@bitman

I have good news. I was able to get your script to work here.

But I believe I have this problem with the environment variables you mentioned. I have two local disks (C and D). But the script only works correctly if the files are on local disk C.

Can you tell me how to fix this problem? Or send me a link showing step by step what I have to do.

I too tested the translation feature and it works fine. The video had audio in Portuguese and was translated into English. How do I choose other languages? For example, I want to translate a video that has the audio in English into Portuguese.

bitman wrote on 11/19/2022, 11:20 AM

@joelsonforte.br Good to hear! The translation (if used with the extra translate argument added) as far as I know, only goes one way: from foreign spoken language to English text...

There is a language argument you can add, but it probably only helps to analyze faster I would guess. By default (=without model arguments) the model used is the multilanguage 'base' model. This is the model I used in the original script. It will auto detect the language and is a good compromise between speed and accuracy. In my document you can find the extra model arguments.

Pure English speech is probably better served than the multilanguage default by adding a model argument:

--base.en or --tiny.en as specific English only model arguments

bitman wrote on 11/19/2022, 11:36 AM

@joelsonforte.br with regard to environment variables: here are a few steps you can follow:

1) on your keyboard: press windows key + s

2) this will bring up the search window, next type: system variables into the search bar.

3) the System Properties panel opens, select the advanced tab (if not already shown)

4) click on Environment Variables

5) click on path (of user variables) then click on edit button, then on a new line you can type in the full path C:\ blabla of the application .exe location such as python, FFmpeg

6) you may need to restart your PC

Note: sometimes you need to add a path in the system variables as well.

joelsonforte.br wrote on 11/19/2022, 11:44 AM

@joelsonforte.br Is it as it appears, you can't actually do direct translations via whisper in Subtitle Edit currently?

It looks like you have to use whisper to get AI generated subs in the original language, then use google translate to translate the text. There's 2 points of potential error instead of 1. I tried ticking the translate box in the whisper prompt but it says 'no text'

@Former user Just use the option Auto Translate of the Subititle Edit.

The biggest advantage of this option is that text generated by Whisper can be translated into a wide variety of languages and not just English, which is apparently the only translation available in Whisper, as @bitman mentioned.

Former user wrote on 11/19/2022, 6:51 PM

@Former user @joelsonforte.br translation in Vegas using my Vegas script "whisper speech to text" works with automatic translation to English (if you add -- task translate in the original script):

Below is the link to a video how the script was able to translate from German

https://www.dropbox.com/s/ms42ljnslel210p/German.mp4?dl=0

@bitman Very nice, but we still need a windows executable version to marry with your script to help a majority of Vegas users. I tried your script with the Resolve Nvidia windows exe, but there is no communication with it,

@todd-b Just use the option Auto Translate of the Subititle Edit.

The biggest advantage of this option is that text generated by Whisper can be translated into a wide variety of languages and not just English, which is apparently the only translation available in Whisper, as @bitman mentioned.

@joelsonforte.br But is there also a negative by using whisper to translate to original language text then google translate to convert that text to another language? I'd rather google translate not be a part of the process, although with the few translates using subtitle edit, it hasn't shown any glaring issues. I would rather whisper do it all, the thinking being only need to worry about whisper inaccuracies, not whisper + google translate inaccuracies . It's still beta though.

I would rather use subtitle edit as it makes fixing errors so easy, but this Resolve version in standalone mode, This is Russian to English.

bitman wrote on 11/21/2022, 3:57 AM

@Former user Indeed it is currently not the most convenient install procedure, certainly for the lesser computer savvy which I presume is the majority of Vegas users, but it is certainly doable for those who are already dabbling with scripting. That said, I am working on a new variant of the script with some option panels.

bitman wrote on 11/29/2022, 2:05 PM

Update 29/11/2022:

I finally got around to extend the original script with option panels, and the capability to import the subtitles from the generated files to a new track.

Links to the new script:

Here is the link to the new Vegas script called "Whisper Speech To Text v2":

https://www.dropbox.com/s/qxpaj4c8ybnlxgq/Whisper%20Speech%20To%20Text%20v2.cs?dl=0

release notes: Whisper Speech To Text (v2)
------------------------------------------------------------
This script will:

(1) convert speech to text on a selected event on the timeline,
and store .txt, .srt and .vtt result files
in the folder where the original media is stored.

(2) whisper transcode options for optimal speed versus accuracy
are selectable in UI (new!)

(3) a translate option (multilanguage speech to to English) is also UI selectable (new!)

(4) add subtitles in new track based on .srt file just generated or
use an existing .srt file if it is already in folder (new!)

(5) long subtitles are automatically split on a newline after 9 words (new!)

bitman wrote on 11/29/2022, 2:14 PM

joelsonforte.br wrote on 11/30/2022, 6:14 AM

@bitman

Thanks for the Script update. It's working correctly here. I imagine it must have been a lot of work to get this all working, but you got it right.

bitman wrote on 11/30/2022, 12:29 PM

@joelsonforte.br Thanks, I am glad you like it. I am not a software programmer per se, although as an engineer, I tested, debugged/patched telecom software for 17 years (in assembler no less) and an additional 15 years of Java, a bit of Python and some PowerShell to maintain and expand inhouse developed test automation software. At least I know how to test software after 3 decades, you can imagine the script was thoroughly tested!

For the script itself I did not know C# at all, but some googling and looking at other scripts can get you started if you have some experience in other programming languages.