tailwind-nextjs-banner

In this blog, we will be using OpenAI's Whisper model to transcribe podcast audio.

Downloading the Audio

First, we'll need a tool to download podcast audio from youtube. We'll use yt-dlp for this.

Go ahead and install yt-dlp from https://github.com/yt-dlp/yt-dlp/releases/. I simply downloaded the .exe, placed it in my Program Files, and added the path to my environment variables.

To test that the yt-dlp CLI is working, run the following command in your terminal:

yt-dlp --help

If you get a whole bunch of help text, you're good to go!

Python Script

Now that we have yt-dlp installed, we'll start writing our python script to download the audio and transcribe it.

Install Python Dependencies

Pytorch

First, let's install pytorch for our machine. I recommend navigating to https://pytorch.org/ and selecting the right preferences for your machine (OS, package manager, compute platform). It is strongly recommended that you use a device with a GPU for this, as it will be much faster.

For me, the command was:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

Other Dependencies

Now, we can install other dependencies like transformers and pydub.

pip3 install transformers pydub

Writing the Script

Download Audio

We are going to use os.system to run yt-dlp CLI and download our podcast to our current directory.

Here is what each of the flags we will use mean:

* --quiet - reduce the terminal output from yt-dlp
* -o {name} - output the file as "{name}.<extention>"
* -x - retrieve audio only
* --audio-format "wav" - get audio in .wav format
* {link} - link to the youtube video

import os
name = 'test'
link = 'YOUTUBE_LINK_HERE'
os.system(f'yt-dlp --quiet -o {name} -x --audio-format "wav" {link}')

When you run this, the audio from the provided link will show up as test.wav in your current directory.

Now, we want to segment the audio for our example so we only transcribe a small portion of our podcast.

We'll use the pydub library to accomplish this, and we will create a new .wav file from our original one. In the case of this example, our file will be named test_0-60s.wav

from pydub import AudioSegment

start_time = 0
end_time = 60

audio = AudioSegment.from_wav(name + '.wav')

if audio.duration_seconds > end_time:
    audio = audio[start_time*1000:end_time*1000]
    trimmed_audio_name = f'{name}_{start_time}-{end_time}s'
    audio.export(trimmed_audio_name + '.wav', format="wav")
else:
    trimmed_audio_name = name

Now, we'll use Whisper to transcribe

First, let's determine which device we are going to use for inference. As mentioned, we prefer to use cuda as a GPU is much more efficient for inference.

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Now, let's load the Whisper model from huggingface transformers library

from transformers import WhisperProcessor, WhisperForConditionalGeneration

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model_text = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
model_text.config.forced_decoder_ids = None

Now, we must resample our audio into a sampling rate that our transformer model can process.

import torchaudio

speech_array, original_sampling_rate = torchaudio.load(trimmed_audio_name + '.wav')

# resample the audio to a frequency our audio model can handle
sampling_rate = 16000 # Hz
resampler = torchaudio.transforms.Resample(original_sampling_rate, sampling_rate)
speech = resampler(speech_array[1]).squeeze().numpy() # resampled speech

Now that we have our audio with a correct sampling rate, let's run the audio through our model to transcribe it.

Since we are limited by the number of tokens (length of audio) we can pass through our model, we will transcribe the audio in 10 second increments and add all of those transcribed chunks together. (This is certainly a crude solution, but it works)

transcription = '' # init transcription and we'll append to it on each loop
chunk_time = 10 # seconds

for i in range(0, len(speech), sampling_rate*chunk_time):

    last = min(i+sampling_rate*chunk_time, len(speech))
    input_features = processor(speech[i:last], sampling_rate=sampling_rate, return_tensors="pt").input_features

    predicted_ids = model_text.generate(input_features.to(device))
    transcription += processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0] + ' '

print(transcription)

That's it! Try running on the first minute or two of your favorite podcast and see how it works.

Full Script

import os
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio
from pydub import AudioSegment

name = 'test'
link = 'YOUTUBE_LINK_HERE'
start_time = 0
end_time = 60

# Download the audio from the link using yt-dlp
os.system(f'yt-dlp --quiet -o {name} -x --audio-format "wav" {link}')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if device.type == "cpu":
    msg = "cuda not detected as an available compute platform.\n This may take a long time."
    print(msg)
    #TODO add some time estimates for cpu vs gpu

audio = AudioSegment.from_wav(name + '.wav')

# segment the downloaded audio if we are not using the full podcast
if audio.duration_seconds > end_time:
    audio = audio[start_time*1000:end_time*1000]
    trimmed_audio_name = f'{name}_{start_time}-{end_time}s'
    audio.export(trimmed_audio_name + '.wav', format="wav")
else:
    trimmed_audio_name = name

processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model_text = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
model_text.config.forced_decoder_ids = None

speech_array, original_sampling_rate = torchaudio.load(trimmed_audio_name + '.wav')

# resample the audio to a frequency our audio model can handle
sampling_rate = 16000 # Hz
resampler = torchaudio.transforms.Resample(original_sampling_rate, sampling_rate)
speech = resampler(speech_array[1]).squeeze().numpy() # resampled speech
transcription = '' # init transcription and we'll append to it on each loop
chunk_time = 10 # seconds

for i in range(0, len(speech), sampling_rate*chunk_time):

    last = min(i+sampling_rate*chunk_time, len(speech))
    input_features = processor(speech[i:last], sampling_rate=sampling_rate, return_tensors="pt").input_features

    predicted_ids = model_text.generate(input_features.to(device))
    transcription += processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0] + ' '

print(transcription)