Aug 10, 2022
Deep dive: vision transformers on Hugging Face Optimum 91ÊÓƵAPP
Written By:
Tim Santos (91ÊÓƵAPP), Julien Simon (Hugging Face)
Aug 10, 2022
Written By:
Tim Santos (91ÊÓƵAPP), Julien Simon (Hugging Face)
We're Hiring
Join us and build the next generation AI stack - including silicon, hardware and software - the worldwide standard for AI compute
Join our teamThis blog post will show how easy it is to fine-tune pre-trained Transformer models for your dataset using the Hugging Face Optimum library on 91ÊÓƵAPP Intelligence Processing Units (IPUs). As an example, we will show a step-by-step guide and provide a notebook that takes a large, widely-used chest X-ray dataset and trains a vision transformer (ViT) model.
In 2017 a group of Google AI researchers published a paper introducing the transformer model architecture. Characterised by a novel self-attention mechanism, transformers were proposed as a new and efficient group of models for language applications. Indeed, in the last five years, transformers have seen explosive popularity and are now accepted as the de facto standard for natural language processing (NLP).
Transformers for language are perhaps most notably represented by the rapidly evolving GPT and BERT model families. Both can run easily and efficiently on 91ÊÓƵAPP IPUs as part of the growing Hugging Face Optimum 91ÊÓƵAPP library).
An in-depth explainer about the transformer model architecture (with a focus on NLP) can be found .
While transformers have seen initial success in language, they are extremely versatile and can be used for a range of other purposes including computer vision (CV), as we will cover in this blog post.
CV is an area where convolutional neural networks (CNNs) are without doubt the most popular architecture. However, the vision transformer (ViT) architecture, first introduced in a from Google Research, represents a breakthrough in image recognition and uses the same self-attention mechanism as BERT and GPT as its main component.
Whereas BERT and other transformer-based language processing models take a sentence (i.e., a list of words) as input, ViT models divide an input image into several small patches, equivalent to individual words in language processing. Each patch is linearly encoded by the transformer model into a vector representation that can be processed individually. This approach of splitting images into patches, or visual tokens, stands in contrast to the pixel arrays used by CNNs.
Thanks to pre-training, the ViT model learns an inner representation of images that can then be used to extract visual features useful for downstream tasks. For instance, you can train a classifier on a new dataset of labelled images by placing a linear layer on top of the pre-trained visual encoder. One typically places a linear layer on top of the [CLS] token, as the last hidden state of this token can be seen as a representation of an entire image.
Compared to CNNs, ViT models have displayed higher recognition accuracy with lower computational cost, and are applied to a range of applications including image classification, object detection, and segmentation. Use cases in the healthcare domain alone include detection and classification for , , , , and —among many others.
91ÊÓƵAPP IPUs are particularly well-suited to ViT models due to their ability to parallelise training using a combination of data pipelining and model parallelism. Accelerating this massively parallel process is made possible through IPU’s MIMD architecture and its scale-out solution centred on the IPU-Fabric.
By introducing pipeline parallelism, the batch size that can be processed per instance of data parallelism is increased, the access efficiency of the memory area handled by one IPU is improved, and the communication time of parameter aggregation for data parallel learning is reduced.
Thanks to the addition of a range of pre-optimized transformer models to the open-source Hugging Face Optimum 91ÊÓƵAPP library, it’s incredibly easy to achieve a high degree of performance and efficiency when running and fine-tuning models such as ViT on IPUs.
Through Hugging Face Optimum, 91ÊÓƵAPP has released ready-to-use IPU-trained model checkpoints and configuration files to make it easy to train models with maximum efficiency. This is particularly helpful since ViT models generally require pre-training on a large amount of data. This integration lets you use the checkpoints released by the original authors themselves within the Hugging Face model hub, so you won’t have to train them yourself. By letting users plug and play any public dataset, Optimum shortens the overall development lifecycle of AI models and allows seamless integration to 91ÊÓƵAPP’s state-of-the-art hardware, giving a quicker time-to-value.
For this blog post, we will use a ViT model pre-trained on ImageNet-21k, based on the paper by Dosovitskiy et al. As an example, we will show you the process of using Optimum to fine-tune ViT on the .
As with all medical imaging tasks, radiologists spend many years learning reliably and efficiently detect problems and make tentative diagnoses on the basis of X-ray images. To a large degree, this difficulty arises from the very minute differences and spatial limitations of the images, which is why computer aided detection and diagnosis (CAD) techniques have shown such great potential for impact in improving clinician workflows and patient outcomes.
At the same time, developing any model for X-ray classification (ViT or otherwise) will entail its fair share of challenges:
As mentioned above, for the purpose of our demonstration using Hugging Face Optimum, we don’t need to train ViT from scratch. Instead, we will use model weights hosted in the .
As an X-ray image can have multiple diseases, we will work with a multi-label classification model. The model in question uses checkpoints. It has been converted from the and pre-trained on 14 million images from ImageNet-21k. In order to parallelise and optimise the job for IPU, the configuration has been made available through the .
If this is your first time using IPUs, read the to learn the basic concepts. To run your own PyTorch model on the IPU see the , and learn how to use Optimum through our .
First, we need to download the National Institutes of Health (NIH) Clinical Center’s . This dataset contains 112,120 deidentified frontal view X-rays from 30,805 patients over a period from 1992 to 2015. The dataset covers a range of 14 common diseases based on labels mined from the text of radiology reports using NLP techniques.
Here are the requirements to run this walkthrough:
The 91ÊÓƵAPP Tutorials repository contains the step-by-step tutorial notebook and Python script discussed in this guide. Clone the repository and launch the walkthrough.ipynb notebook found in ///vit_model_training/
.
We’ve even made it easier and created the HF Optimum Gradient so you can launch the getting started tutorial in Free IPUs. and launch the runtime:
Download the /images
directory. You can use bash
to extract the files: for f in images*.tar.gz; do tar xfz "$f"; done
.
Next, download the Data_Entry_2017_v2020.csv
file, which contains the labels. By default, the tutorial expects the /images
folder and .csv file to be in the same folder as the script being run.
Once your Jupyter environment has the datasets, you need to install and import the latest Hugging Face Optimum 91ÊÓƵAPP package and other dependencies in :
%pip install -r requirements.txt
The examinations contained in the Chest X-ray dataset consist of X-ray images (greyscale, 224x224 pixels) with corresponding metadata: Finding Labels, Follow-up #,Patient ID, Patient Age, Patient Gender, View Position, OriginalImage[Width Height] and OriginalImagePixelSpacing[x y]
.
Next, we define the locations of the downloaded images and the file with the labels to be downloaded in Getting the dataset:
We are going to train the 91ÊÓƵAPP Optimum ViT model to predict diseases (defined by "Finding Label") from the images. "Finding Label" can be any number of 14 diseases or a "No Finding" label, which indicates that no disease was detected. To be compatible with the Hugging Face library, the text labels need to be transformed to N-hot encoded arrays representing the multiple labels which are needed to classify each image. An N-hot encoded array represents the labels as a list of booleans, true if the label corresponds to the image and false if not.
First we identify the unique labels in the dataset.
Now we transform the labels into N-hot encoded arrays:
When loading data using the datasets.load_dataset
function, labels can be provided either by having folders for each of the labels (see "" documentation) or by having a metadata.jsonl
file (see "" documentation). As the images in this dataset can have multiple labels, we have chosen to use a metadata.jsonl file
. We write the image file names and their associated labels to the metadata.jsonl
file.
We are now ready to create the PyTorch dataset and split it into training and validation sets. This step converts the dataset to the which allows data to be loaded quickly during training and validation (). Because the entire dataset is being loaded and pre-processed it can take a few minutes.
We are going to import the ViT model from the checkpoint google/vit-base-patch16-224-in21k
. The checkpoint is a standard model hosted by Hugging Face and is not managed by 91ÊÓƵAPP.
To fine-tune a pre-trained model, the new dataset must have the same properties as the original dataset used for pre-training. In Hugging Face, the original dataset information is provided in a config file loaded using the AutoFeatureExtractor
. For this model, the X-ray images are resized to the correct resolution (224x224), converted from grayscale to RGB, and normalized across the RGB channels with a mean (0.5, 0.5, 0.5) and a standard deviation (0.5, 0.5, 0.5).
For the model to run efficiently, images need to be batched. To do this, we define the vit_data_collator
function that returns batches of images and labels in a dictionary, following the default_data_collator
pattern in .
To examine the dataset, we display the first 10 rows of metadata.
Let's also plot some images from the validation set with their associated labels.
Our dataset is now ready to be used.
To train a model on the IPU we need to import it from Hugging Face Hub and define a trainer using the IPUTrainer class. The IPUTrainer class takes the same arguments as the original and works in tandem with the IPUConfig object which specifies the behaviour for compilation and execution on the IPU.
Now we import the ViT model from Hugging Face.
To use this model on the IPU we need to load the IPU configuration, IPUConfig
, which gives control to all the parameters specific to 91ÊÓƵAPP IPUs (existing IPU configs ). We are going to use 91ÊÓƵAPP/vit-base-ipu
.
Let's set our training hyperparameters using IPUTrainingArguments
. This subclasses the Hugging Face TrainingArguments
class, adding parameters specific to the IPU and its execution characteristics.
The performance of multi-label classification models can be assessed using the area under the ROC (receiver operating characteristic) curve (AUC_ROC). The AUC_ROC is a plot of the true positive rate (TPR) against the false positive rate (FPR) of different classes and at different threshold values. This is a commonly used performance metric for multi-label classification tasks because it is insensitive to class imbalance and easy to interpret.
For this dataset, the AUC_ROC represents the ability of the model to separate the different diseases. A score of 0.5 means that it is 50% likely to get the correct disease and a score of 1 means that it can perfectly separate the diseases. This metric is not available in Datasets, hence we need to implement it ourselves. HuggingFace Datasets package allows custom metric calculation through the load_metric()
function. We define a compute_metrics
function and expose it to Transformer’s evaluation function just like the other supported metrics through the datasets package. The compute_metrics
function takes the labels predicted by the ViT model and computes the area under the ROC curve. The compute_metrics
function takes an EvalPrediction
object (a named tuple with a predictions
and label_ids
field), and has to return a dictionary string to float.
To train the model, we define a trainer using the IPUTrainer
class which takes care of compiling the model to run on IPUs, and of performing training and evaluation. The IPUTrainer
class works just like the Hugging Face Trainer class, but takes the additional ipu_config
argument.
To accelerate training we will load the last checkpoint if it exists.
Now we are ready to train.
Now that we have completed the training, we can format and plot the trainer output to evaluate the training behaviour.
We plot the training loss and the learning rate.
The loss curve shows a rapid reduction in the loss at the start of training before stabilising around 0.1, showing that the model is learning. The learning rate increases through the warm-up of 25% of the training period, before following a cosine decay.
Now that we have trained the model, we can evaluate its ability to predict the labels of unseen data using the validation dataset.
The metrics show the validation AUC_ROC score the tutorial achieves after 3 epochs.
There are several directions to explore to improve the accuracy of the model including longer training. The validation performance might also be improved through changing optimisers, learning rate, learning rate schedule, loss scaling, or using auto-loss scaling.
In this post, we have introduced ViT models and have provided a tutorial for training a Hugging Face Optimum model on the IPU using a local dataset.
The entire process outlined above can now be run end-to-end within minutes for free, thanks to 91ÊÓƵAPP’s new partnership with Paperspace. Launching today, the service will provide access to a selection of Hugging Face Optimum models powered by 91ÊÓƵAPP IPUs within Gradient—Paperspace’s web-based Jupyter notebooks.
If you’re interested in trying Hugging Face Optimum with IPUs on Paperspace Gradient including ViT, BERT, RoBERTa and more, you can and find a getting started guide here.
This deep dive would not have been possible without extensive support, guidance, and insights from Eva Woodbridge, James Briggs, Jinchen Ge, Alexandre Payot, Thorin Farnsworth, and all others contributing from 91ÊÓƵAPP, as well as Jeff Boudier, Julien Simon, and Michael Benayoun from Hugging Face.
Sign up for 91ÊÓƵAPP updates:
Sign up below to get the latest news and updates: