Fine-tuning Flan-T5 XXL

Faced with a vast and growing array of large language models (LLMs), users are having to ask themselves an important question - which one delivers the right balance of performance and efficiency?

Ultra large models such as ChatGPT are typically larger because they carry more world knowledge and offer enhanced performance in free text generation. This can be useful in AI Assistants or chatbots, but does not make them the most efficient solution for every task.

We can achieve state-of-the-art (SOTA) performance for a wide range of NLP applications using smaller, more cost-effective model such as Flan-T5.

We covered the benefits of Flan-T5 in its Large and XL guises when we launched inference notebooks on Paperspace.

Now, we’re pleased to introduce fine-tuning of Flan-T5 XXL (and XL) for the Graphcore IPU. By fine-tuning this 11bn parameter version of Flan-T5, developers and organizations can optimise performance for their specific NLP workload.

And because Flan-T5 XXL and its pretrained weights are open-source and freely available to download, it can be modified for commercial use without licensing restrictions.

Flan-T5 XXL, as well as its smaller 3bn parameter relative Flan-T5 XL, can be fine-tuned and run on any Graphcore system from IPU Pod₁₆ upwards, using Paperspace Gradient Notebooks.

We are also making available inference notebooks for both sizes of Flan-T5.

Flan-T5 XXL can be run on a minimum IPU-Pod₁₆, while Flan-T5 XL inference will run on an IPU-Pod4 (six hour free trial available from Paperspace).

Performance

is an encoder-decoder transformer model that reframes all NLP tasks into a text-to-text format. Compared to T5, Flan-T5 has been fine-tuned on more than 1000 additional tasks.

By looking at its performance on the Multi-task Language Understanding (MMLU) benchmark, we can see that it is competitive with much larger models.

T5 table (1)

Part of the MMLU leaderboard from

For a deeper analysis of Flan-T5 and its performance on various NLP tasks, check out our other blogs Flan-T5‌: sweet results with the smaller, more efficient LLM and Running Flan-T5 XL in inference in float16 for IPU – How we did it.

Language models are powerful because a huge variety of tasks can be formulated as text-to-text problems and thus adapted to fit the generative setup, where the model is asked to predict future tokens. For more details, check out the T5 paper , and the Flan-T5 paper .

In this blog post and we apply the idea from the T5 paper and fine-tune Flan-T5 on the task of textual entailment with the dataset. We will also see how you can easily adapt this example for custom fine-tuning of several downstream tasks.

Note: the notebook supports both Flan-T5 XXL and Flan-T5 XL, but the code snippets in this blog post refer to the XXL model.

Dataset

The MNLI dataset consists of pairs of sentences, a premise and a hypothesis. The task is to predict the relation between the premise and the hypothesis, which can be:

entailment: hypothesis follows from the premise,
contradiction: hypothesis contradicts the premise,
neutral: hypothesis and premise are unrelated.

The data splits for the MNLI dataset are the following:

Train split: 392’702 examples
Validation matched split: 9’815 examples
Validation mismatched split: 9’832 examples

The matched split is made of samples derived from the same sources as those in the training set, and samples in the mismatched split are derived from different sources to those in the training set, and therefore don't resemble the examples seen at training time. For validation, we're going to use the latter.

You can explore the on Hugging Face.

MNLI As mentioned, T5 has an encoder-decoder architecture, so it needs two input sequences: one for the encoder and one for the decoder. We form input prompts for the encoder with the format:


mnli hypothesis: {hypothesis} premise: {premise}

We provide the decoder with the corresponding label, shifted right and prepended with the <pad> token:


<pad>{label}

For example, an encoder sequence would be:


mnli hypothesis: Product and geography are what make cream skimming work. premise: Conceptually cream skimming has two basic dimensions - product and geography.

Similarly, an example decoder sequence would be:


<pad>neutral

The pad token acts as decoder_start_token_id for the T5 models.

Then, the encoder and decoder sequences are tokenized and padded to the model sequence length of 512.

Since the model is trained to predict the MNLI class, the labels are simply the decoder input sequence shifted by one token to the left, which means that the labels will simply be the MNLI class, without the pad token at the beginning.

Customise configuration

If we wish, we can customise some of the parameters for fine-tuning and validation.

For fine-tuning, we can change the number of training steps, the learning rate, the optimizer parameters, and parameters related to the periodic checkpointing.

If you have enough IPUs available, you can speed up training by using data parallelism. T5-XXL needs 16 IPUs, so if you have 64 IPUs you can set data parallelism to 4. Similarly, the XL variant needs 8 IPUs, so if you have 16 IPUs you can set data parallelism to 2.

For validation, you can control the maximum number of tokens generated by the model. You can reduce the sequence length to account for the maximum length of sentences encountered in the validation dataset, which might allow us to increase the batch size.

Create a T5 Trainer

We use Google’s Flan-T5 checkpoint available on Hugging Face. These weights are the starting point for our fine-tuning.


pretrained = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl")

We are ready to create the training session with the help of the T5Trainer class.


trainer = T5Trainer( 
    config, 
    pretrained, 
    dataset, 
    eval_dataset, 
    eval_config, 
    tokenizer, 
    accuracy_metric, 
    postprocess_mnli_predictions, 
)

Run fine-tuning

We can now run fine-tuning with the following line:


trainer.train()

Run validation

Finally, we validate our model on the validation_mismatched split of the MNLI dataset. The resulting model should achieve an accuracy of about 87% when fine-tuned for 500 steps.


trainer.evaluate()

Convert to Hugging Face checkpoint

You can save the fine-tuned weights so that they can be uploaded to Hugging Face.


finetuned = trainer.save_hf_checkpoint(hf_checkpoint_path, ckpt_path)

Run the model with Hugging Face pipeline

The same model can later be used with the standard Hugging Face pipeline on any hardware.


from transformers import pipeline 
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xxl") 
hf_model = T5ForConditionalGeneration.from_pretrained(hf_checkpoint_path) 
generator = pipeline("text2text-generation", model=hf_model, tokenizer=tokenizer) 
prompt = ( 
    "mnli hypothesis: Your contributions were of no help with our students' education. " 
    "premise: Your contribution helped make it possible for us to provide our students with a quality education." 
) 
out = generator(prompt)

Output:


[{'generated_text': ' contradiction'}]

Conclusion

Flan-T5 XXL is easy to fine-tune on IPUs on Paperspace and is applicable to a lot of NLP applications. The model is able to match the performance of larger models in various NLP tasks, at a fraction of the cost. Flan-T5 XXL can be further fine-tuned to achieve SOTA on a given application. Therefore, it stands in an optimal position with regards to the performance and cost trade-off of LLMs.

The runs on a Graphcore Pod₁₆, which is a low-cost cloud instance and is the best starting point for Flan-T5 exploration.