Aug 09, 2023
Fine-tune OpenAI's Whisper Automatic Speech Recognition (ASR) model
Written By:
Goran Katalinic
Aug 09, 2023
Written By:
Goran Katalinic
We're Hiring
Join us and build the next generation AI stack - including silicon, hardware and software - the worldwide standard for AI compute
Join our teamWhisper – the open source automatic speech recognition (ASR) model created by OpenAI – is incredibly powerful out of the box.
It is trained on 680,000 hours of labelled audio data, 117,000 hours of which cover 96 languages other than English, meaning that it can be applied to a wide range of applications with great results.
The vanilla version Whisper is available to run for inference in a Paperspace Gradient Notebook, powered by 91ÊÓƵAPP IPUs.
There are also good reasons to fine-tune Whisper for a particular use case. This could include accounting for the complex and sometimes subtle differences in speech and vocabulary as influenced by:
Some organisations may have large amounts of proprietary audio data that can be used in the fine-tuning process. For others, gathering the audio necessary for fine-tuning is not a trivial undertaking.
Thankfully, there are several open-sourced speech recognition datasets available, covering multiple languages. The largest of these are:
There are smaller datasets covering many more languages and dialects, such as:
In our Paperspace Gradient Notebook, we demonstrate fine-tuning using the of OpenSLR.
Get started by running the Whisper Small Fine Tuning notebook on Paperspace.
For each code block below, you can simply click to run the block in Paperspace - making any modifications to code/parameters, where relevant. We explain how to run the process in environments other than Paperspace Gradient Notebooks at the end of this blog.
# Install optimum-graphcore from source
!pip install git+https://github.com/huggingface/optimum-graphcore.git@v0.7.1 "soundfile" "librosa" "evaluate" "jiwer"
%pip install "graphcore-cloud-tools[logger] @ git+https://github.com/graphcore/graphcore-cloud-tools"
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger
import os
n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/whisper"
# Generic imports
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import evaluate
import numpy as np
import torch
from datasets import load_dataset, Audio, Dataset, DatasetDict
# IPU-specific imports
from optimum.graphcore import (
IPUConfig,
IPUSeq2SeqTrainer,
IPUSeq2SeqTrainingArguments,
)
from optimum.graphcore.models.whisper import WhisperProcessorTorch
# HF-related imports
from transformers import WhisperForConditionalGeneration
Common Voice datasets consist of recordings of speakers reading text from Wikipedia in different languages. 🤗 Datasets enables us to easily download and prepare the training and evaluation splits.
First, ensure you have accepted the terms of use on the 🤗 Hub: . Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.
dataset = DatasetDict()
split_dataset = Dataset.train_test_split(
load_dataset("openslr", "SLR69", split="train", token=False), test_size=0.2, seed=0
)
dataset["train"] = split_dataset["train"]
dataset["eval"] = split_dataset["test"]
print(dataset)
The columns of interest are:
audio
: the raw audio samplessentence
: the corresponding ground truth transcription.We drop the path
column.
dataset = dataset.remove_columns(["path"])
Since Whisper was pre-trained on audio sampled at 16 kHz, we must ensure the Common Voice samples are downsampled accordingly.
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
We prepare the datasets by extracting features from the raw audio inputs and injecting labels which are simply transcriptions with some basic processing.
The feature extraction is provided by 🤗 Transformers WhisperFeatureExtractor
. To decode generated tokens into text after running the model, we will similarly require a tokenizer, WhisperTokenizer
. Both of these are wrapped by an instance of WhisperProcessor
.
MODEL_NAME = "openai/whisper-small"
LANGUAGE = "spanish"
TASK = "transcribe"
MAX_LENGTH = 224
processor = WhisperProcessorTorch.from_pretrained(MODEL_NAME, language=LANGUAGE, task=TASK)
processor.tokenizer.pad_token = processor.tokenizer.eos_token
processor.tokenizer.max_length = MAX_LENGTH
processor.tokenizer.set_prefix_tokens(language=LANGUAGE, task=TASK)
def prepare_dataset(batch, processor):
inputs = processor.feature_extractor(
raw_speech=batch["audio"]["array"],
sampling_rate=batch["audio"]["sampling_rate"],
)
batch["input_features"] = inputs.input_features[0].astype(np.float16)
transcription = batch["sentence"]
batch["labels"] = processor.tokenizer(text=transcription).input_ids
return batch
columns_to_remove = dataset.column_names["train"]
dataset = dataset.map(
lambda elem: prepare_dataset(elem, processor),
remove_columns=columns_to_remove,
num_proc=1,
)
train_dataset = dataset["train"]
eval_dataset = dataset["eval"]
Lastly, we pre-process the labels by padding them with values that will be ignored during fine-tuning. This padding is to ensure tensors of static shape are provided to the model. We do this on the fly via the data collator below.
@dataclass
class DataCollatorSpeechSeq2SeqWithLabelProcessing:
processor: Any
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
batch = {}
batch["input_features"] = torch.tensor([feature["input_features"] for feature in features])
label_features = [{"input_ids": feature["labels"]} for feature in features]
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt", padding="longest", pad_to_multiple_of=MAX_LENGTH)
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
batch["labels"] = labels
return batch
The performance of our fine-tuned model will be evaluated using word error rate (WER).
metric = evaluate.load("wer")
def compute_metrics(pred, tokenizer):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
pred_ids = np.where(pred_ids != -100, pred_ids, tokenizer.pad_token_id)
label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
normalized_pred_str = [tokenizer._normalize(pred).strip() for pred in pred_str]
normalized_label_str = [tokenizer._normalize(label).strip() for label in label_str]
wer = 100 * metric.compute(predictions=pred_str, references=label_str)
normalized_wer = 100 * metric.compute(predictions=normalized_pred_str, references=normalized_label_str)
return {"wer": wer, "normalized_wer": normalized_wer}
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)
model.config.max_length = MAX_LENGTH
model.generation_config.max_length = MAX_LENGTH
Ensure language-appropriate tokens, if any, are set for generation. We set them on both the config
and the generation_config
to ensure they are used correctly during generation.
model.config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(
language=LANGUAGE, task=TASK
)
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(
language=LANGUAGE, task=TASK
)
model.generation_config.suppress_tokens = []
The model can be directly fine-tuned on the IPU using the IPUSeq2SeqTrainer
class.
The IPUConfig
object specifies how the model will be pipelined across the IPUs.
For fine-tuning, we place the encoder on two IPUs, and the decoder on two IPUs.
For inference, the encoder is placed on one IPU, and the decoder on a different IPU.
replication_factor = n_ipu // 4
ipu_config = IPUConfig.from_dict(
{
"optimizer_state_offchip": True,
"recompute_checkpoint_every_layer": True,
"enable_half_partials": True,
"executable_cache_dir": executable_cache_dir,
"gradient_accumulation_steps": 16,
"replication_factor": replication_factor,
"layers_per_ipu": [5, 7, 5, 7],
"matmul_proportion": [0.2, 0.2, 0.6, 0.6],
"projection_serialization_factor": 5,
"inference_replication_factor": 1,
"inference_layers_per_ipu": [12, 12],
"inference_parallelize_kwargs": {
"use_cache": True,
"use_encoder_output_buffer": True,
"on_device_generation_steps": 16,
}
}
)
Lastly, we specify the arguments controlling the training process.
total_steps = 1000 // replication_factor
training_args = IPUSeq2SeqTrainingArguments(
output_dir="./whisper-small-ipu-checkpoints",
do_train=True,
do_eval=True,
predict_with_generate=True,
learning_rate=1e-5 * replication_factor,
warmup_steps=total_steps // 4,
evaluation_strategy="steps",
eval_steps=total_steps,
max_steps=total_steps,
save_strategy="steps",
save_steps=total_steps,
logging_steps=25,
dataloader_num_workers=16,
dataloader_drop_last=True,
)
Then, we just need to pass all of this together with our datasets to the IPUSeq2SeqTrainer
class:
trainer = IPUSeq2SeqTrainer(
model=model,
ipu_config=ipu_config,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=DataCollatorSpeechSeq2SeqWithLabelProcessing(processor),
compute_metrics=lambda x: compute_metrics(x, processor.tokenizer),
tokenizer=processor.feature_extractor,
)
To gauge the improvement in WER, we run an evaluation step before fine-tuning.
trainer.evaluate()
All that remains is to fine-tune the model! The fine-tuning process should take between 6 and 18 minutes, depending on how many replicas are used, and achieve a final WER of around 10%.
trainer.train()
To run the Whisper Small fine-tuning demo using IPU hardware other than in a Paperspace Gradient Notebook, you need to have the Poplar SDK enabled.
Refer to the for your system for details on how to enable the Poplar SDK. Also refer to the guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.
In this notebook, we demonstrated how to fine-tune Whisper for multi-lingual speech recognition and transcription on the IPU.
We used a single replica on a total of four IPUs. To reduce the fine-tuning time, more than one replica, hence more IPUs are required. On Paperspace, you can use either an IPU Pod16 or a Bow Pod16, both with 16 IPUs. Please contact 91ÊÓƵAPP if you need assistance running on larger platforms.
For all available notebooks, check IPU-powered Jupyter Notebooks to see how IPUs perform on other tasks.
Have a question? Please contact us on our 91ÊÓƵAPP community channel.
Sign up for 91ÊÓƵAPP updates:
Sign up below to get the latest news and updates: