- Published on
How to Transcribe Podcasts with AI in Python
- Authors
- Name
- Tinker Assist
In this blog, we will be using OpenAI's Whisper model to transcribe podcast audio.
Downloading the Audio
First, we'll need a tool to download podcast audio from youtube. We'll use yt-dlp
for this.
Go ahead and install yt-dlp from https://github.com/yt-dlp/yt-dlp/releases/. I simply downloaded the .exe, placed it in my Program Files, and added the path to my environment variables.
To test that the yt-dlp CLI is working, run the following command in your terminal:
yt-dlp --help
If you get a whole bunch of help text, you're good to go!
Python Script
Now that we have yt-dlp installed, we'll start writing our python script to download the audio and transcribe it.
Install Python Dependencies
Pytorch
First, let's install pytorch for our machine. I recommend navigating to https://pytorch.org/ and selecting the right preferences for your machine (OS, package manager, compute platform). It is strongly recommended that you use a device with a GPU for this, as it will be much faster.
For me, the command was:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
Other Dependencies
Now, we can install other dependencies like transformers and pydub.
pip3 install transformers pydub
Writing the Script
Download Audio
We are going to use os.system to run yt-dlp CLI and download our podcast to our current directory.
Here is what each of the flags we will use mean:
* --quiet - reduce the terminal output from yt-dlp
* -o {name} - output the file as "{name}.<extention>"
* -x - retrieve audio only
* --audio-format "wav" - get audio in .wav format
* {link} - link to the youtube video
import os
name = 'test'
link = 'YOUTUBE_LINK_HERE'
os.system(f'yt-dlp --quiet -o {name} -x --audio-format "wav" {link}')
When you run this, the audio from the provided link will show up as test.wav in your current directory.
Now, we want to segment the audio for our example so we only transcribe a small portion of our podcast.
We'll use the pydub library to accomplish this, and we will create a new .wav file from our original one. In the case of this example, our file will be named test_0-60s.wav
from pydub import AudioSegment
start_time = 0
end_time = 60
audio = AudioSegment.from_wav(name + '.wav')
if audio.duration_seconds > end_time:
audio = audio[start_time*1000:end_time*1000]
trimmed_audio_name = f'{name}_{start_time}-{end_time}s'
audio.export(trimmed_audio_name + '.wav', format="wav")
else:
trimmed_audio_name = name
Now, we'll use Whisper to transcribe
First, let's determine which device we are going to use for inference. As mentioned, we prefer to use cuda as a GPU is much more efficient for inference.
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
Now, let's load the Whisper model from huggingface transformers library
from transformers import WhisperProcessor, WhisperForConditionalGeneration
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model_text = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
model_text.config.forced_decoder_ids = None
Now, we must resample our audio into a sampling rate that our transformer model can process.
import torchaudio
speech_array, original_sampling_rate = torchaudio.load(trimmed_audio_name + '.wav')
# resample the audio to a frequency our audio model can handle
sampling_rate = 16000 # Hz
resampler = torchaudio.transforms.Resample(original_sampling_rate, sampling_rate)
speech = resampler(speech_array[1]).squeeze().numpy() # resampled speech
Now that we have our audio with a correct sampling rate, let's run the audio through our model to transcribe it.
Since we are limited by the number of tokens (length of audio) we can pass through our model, we will transcribe the audio in 10 second increments and add all of those transcribed chunks together. (This is certainly a crude solution, but it works)
transcription = '' # init transcription and we'll append to it on each loop
chunk_time = 10 # seconds
for i in range(0, len(speech), sampling_rate*chunk_time):
last = min(i+sampling_rate*chunk_time, len(speech))
input_features = processor(speech[i:last], sampling_rate=sampling_rate, return_tensors="pt").input_features
predicted_ids = model_text.generate(input_features.to(device))
transcription += processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0] + ' '
print(transcription)
That's it! Try running on the first minute or two of your favorite podcast and see how it works.
Full Script
import os
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio
from pydub import AudioSegment
name = 'test'
link = 'YOUTUBE_LINK_HERE'
start_time = 0
end_time = 60
# Download the audio from the link using yt-dlp
os.system(f'yt-dlp --quiet -o {name} -x --audio-format "wav" {link}')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if device.type == "cpu":
msg = "cuda not detected as an available compute platform.\n This may take a long time."
print(msg)
#TODO add some time estimates for cpu vs gpu
audio = AudioSegment.from_wav(name + '.wav')
# segment the downloaded audio if we are not using the full podcast
if audio.duration_seconds > end_time:
audio = audio[start_time*1000:end_time*1000]
trimmed_audio_name = f'{name}_{start_time}-{end_time}s'
audio.export(trimmed_audio_name + '.wav', format="wav")
else:
trimmed_audio_name = name
processor = WhisperProcessor.from_pretrained("openai/whisper-small")
model_text = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
model_text.config.forced_decoder_ids = None
speech_array, original_sampling_rate = torchaudio.load(trimmed_audio_name + '.wav')
# resample the audio to a frequency our audio model can handle
sampling_rate = 16000 # Hz
resampler = torchaudio.transforms.Resample(original_sampling_rate, sampling_rate)
speech = resampler(speech_array[1]).squeeze().numpy() # resampled speech
transcription = '' # init transcription and we'll append to it on each loop
chunk_time = 10 # seconds
for i in range(0, len(speech), sampling_rate*chunk_time):
last = min(i+sampling_rate*chunk_time, len(speech))
input_features = processor(speech[i:last], sampling_rate=sampling_rate, return_tensors="pt").input_features
predicted_ids = model_text.generate(input_features.to(device))
transcription += processor.batch_decode(predicted_ids, skip_special_tokens=True, normalize=True)[0] + ' '
print(transcription)