how can I test speech to text?

RogerS wrote on 12/23/2022, 2:50 AM

Thanks for the recommendation. I downloaded the windows exe but it just loads and exits without doing anything. The instructions are lacking as to where to put a model or the media to be transcribed or how to trigger the transcription process without Resolve Studio.

Former user wrote on 12/23/2022, 3:21 AM

@RogerS It downloads the model it'self after you make your choice, but shutting down straight away doesn't sound right. I have used it a number of times for testing in stand alone mode, but I do have Resolve Studio however I had thought if it wasn't running it was the same as not having it installed.

(The 2 versions that are supposed to show depending on if Resolve is loaded)

RogerS wrote on 12/23/2022, 3:24 AM

Mine stops at the CUDA message and then after a few minutes quits. I have Resolve installed but not Studio. I've never seen this GUI, just the text above it.

bitman wrote on 12/23/2022, 3:28 AM

@bitman @Subtitler22 or anyone else using the Vegas Btiman version of whisper, if you have time, could you download the clip and add the subtitles your version creates. Don't modify anything, I"m interested in raw outputs. Yellow = Davinci Resolve version (standalone mode), and white is subtitle edit (actually you could paste your .srt sub file instead)

@Former user @RogerS

Here is the result of the Vegas script with Whisper; I used a variant with --model large --task translate; this was executed in 27 seconds on my PC. It is using by default my 24 GB GPU (as I installed PyTorch a few days ago).

here is the .srt copy pasted contend:

1
00:00:00,000 --> 00:00:15,000
Our unit today has more than 50 targets, including work on NATO weapons.

2
00:00:16,000 --> 00:00:25,000
If we say that at the initial stage we did not understand what it was, then now we are already free to work on Kymars.

3
00:00:25,000 --> 00:00:35,000
We have passed a new program, and these so-called Kymars are now, as usual, aerodynamic targets.

4
00:00:35,000 --> 00:00:55,000
We are free to strike, observe and destroy without any problems.

Update:

Running the --large model a second or third time, the speed is now 19s iso of 27s first time (I guess the large model had to be downloaded and is residing in memory now for the consecutive runs and speeds things up.

bitman wrote on 12/23/2022, 3:41 AM

@Former user @RogerS

Here is another result of the Vegas script with Whisper; I used a faster variant with --model medium --task translate; this was executed in 12 seconds on my PC. It is using by default my 24 GB GPU (as I installed PyTorch a few days ago).

here is the .srt contend copied

1
00:00:00,000 --> 00:00:15,000
Today our unit has more than 50 targets, including the work on NATO weapons in Himmars.

2
00:00:15,000 --> 00:00:25,000
If we say that at the initial stages we did not understand what it was, then now we are working freely on the work in Himmars.

3
00:00:25,000 --> 00:00:41,000
We have a new program, and these so-called Himmars now, as usual aerodynamic targets, we freely see, observe and destroy without problems.

Subtitler22 wrote on 12/23/2022, 4:02 AM

If you have a GPU with memory at least 4GB, you can lower the transcription time by making Whisper transcibe using the GPU instead of using the CPU, and using PowerShell.
I bought 8GB GPU and installed it today and it can transcribe a Japanese video of 1 hour in 30 minutes or less, depending on how much speech there is, using the --model medium.
You mentioned that 1 minute audio takes you 2 minutes to process. Using a GPU you can expect 1 minute audio to process in 30 seconds or less.

I do have GPUs with more than 4GB of ram. When I read into Whisper's implementation with Subtitle Edit it said it was CPU only. I was wondering if there was a way to leverage the GPU. How do you do that? Powershell looks like a command prompt so does this not involve Subtitle Edit at all?

Whisper doesn't use the CPU very efficiently, only using 4 cores. I figured out I can run 3 at once to max out my CPU and with the batch feature could leave it running all day. As a bonus it heated my room without the use of a heater (drawing about 200 watts for ~10 hours straight).

@RogerS I am not using Subtitle Edit and prefer to do the transcription work using CMD or Powershell.

I installed Whisper AI according to @bitman excellent instructions in pdf.
If you want to try that then please check if you have Python installed and which version.
For Whisper you MUST use version 3.10.7 I tried newer version without success.
During the Python install, consider using the custom install. This will give you the chance to install it on a subfolder on you C: drive, instead of the user default which tends to be messy, at least for me.

Python installs PIP as well. After the Python installation, check if they have been added to the path using cmd.
Check Python by typing py and again open cmd and type pip
If you get proper response then all is well, otherwise you have to check the path in the system environment variables.

Please note that you can install Whisper directly using pip instead of from within Python.

Install all other software and then test whisper using cmd.

If all goes well, this means whisper has installed Torch as well. This will work fine using the cpu, but to use the GPU, you must uninstall Torch using this command:
pip3 uninstall torch

Then install PyTorch using this command:
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

Now you can use whisper using powershell or cmd and it will use your GPU by default, but if you want to force it to use the cpu then just add to the syntax --device cpu

Try whisper to transcribe a small audio file using Powershell and run as administrator. This will give you the chance to see how Python works in the Task Manager.

I usually close all other applications except the Task Manager.

I must add that my motherboard is quite old from 2013 with Intel i5 CPU. Adding a new GPU of 8GB made it possible to transcribe efficiently and surprisingly fast even on an old mobo.
My GPU is PCI-E 4 and I am assuming that my mobo is running it on PCI-E 3 which uses 16 lanes instead of 32 lanes. I might build a new PC with PCI-E 4 support for even faster transcription.

bitman wrote on 12/23/2022, 4:19 AM

If you have a GPU with memory at least 4GB, you can lower the transcription time by making Whisper transcibe using the GPU instead of using the CPU, and using PowerShell.
I bought 8GB GPU and installed it today and it can transcribe a Japanese video of 1 hour in 30 minutes or less, depending on how much speech there is, using the --model medium.
You mentioned that 1 minute audio takes you 2 minutes to process. Using a GPU you can expect 1 minute audio to process in 30 seconds or less.

I do have GPUs with more than 4GB of ram. When I read into Whisper's implementation with Subtitle Edit it said it was CPU only. I was wondering if there was a way to leverage the GPU. How do you do that? Powershell looks like a command prompt so does this not involve Subtitle Edit at all?

Whisper doesn't use the CPU very efficiently, only using 4 cores. I figured out I can run 3 at once to max out my CPU and with the batch feature could leave it running all day. As a bonus it heated my room without the use of a heater (drawing about 200 watts for ~10 hours straight).

@RogerS whisper via my Vegas script runs now on GPU if you install PyTorch (just one step more🤓). I have adapted the install instructions to add PyTorch.

The Vegas script is really convenient to use (once everything is installed), it is faster and more accurate than 365. There is also no need for Resolve, and you can edit the scripting code yourself to your liking to add more whisper parameters for fine tuning. I estimate you can complete all the install steps in about 2 hours.

Former user wrote on 12/23/2022, 5:35 AM

@Former user @RogerS

Here is another result of the Vegas script with Whisper; I used a faster variant with --model medium --task translate; this was executed in 12 seconds on my PC. It is using by default my 24 GB GPU (as I installed PyTorch a few days ago).

00:00:25,000 --> 00:00:41,000
We have a new program, and these so-called Himmars now, as usual aerodynamic targets, we freely see, observe and destroy without problems.

@bitman Thanks for that 👍. Yours shows the same unexpected behavior. The medium dictionary knows better Russian than the large dictionary. I've found a number on inconsistencies with the large multilingual dictionary. The large model was updated 2 weeks ago, but i'm unsure if it's just a matter of downloading the large dictionary again, and you get v2 or you need a version of whisper. I havent' even skimmed over the info yet, hopefully you're still using V1 and we can all look forward to much better results from large dictionary

https://github.com/openai/whisper/discussions/661

As for speed, yeah these versions are amazing. I noticed although I installed whisper on my F: drive the dictionary (copies?) seem to be kept in boot drive cache. example large.pt and medium.pt . I have a large.bin on F drive, so possibly it gets decompressed from the 2gb .bin to form the 3gb .pt file on boot drive. If it remains on boot uncompressed should stay faster.

Anyway I"ll look some more into large dictionary. Uses 9gb of 10gb VRAM which is bad news for Vegas users where it eats up lots of VRAM and doesn't let it go. With 24GB of Vram you have nothing to worry about 😺