Hugging Face trainer integration¶
This notebook explores how to integrate Stained Glass Core with Hugging Face trainers. This notebook trains a DistilBERT model on Extractive Question/Answering. This notebook is adapted from the official Hugging Face Question answering tutorial.
%pip install transformers datasets
Question answering tasks return an answer given a question. If you've ever asked a virtual assistant like Alexa, Siri or Google what the weather is, then you've used a question answering model before. There are two common types of question answering tasks:
- Extractive: extract the answer from the given context.
- Abstractive: generate an answer from the context that correctly answers the question.
This guide will show you how to:
- Finetune DistilBERT on the SQuAD dataset for extractive question answering.
- Use your finetuned model for inference.
ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BLOOM, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, FlauBERT, FNet, Funnel Transformer, OpenAI GPT-2, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLMv2, LayoutLMv3, LED, LiLT, Longformer, LUKE, LXMERT, MarkupLM, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MVP, Nezha, Nyströmformer, OPT, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, Splinter, SqueezeBERT, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO
Before you begin, make sure you have all the necessary libraries installed:
Load SQuAD dataset¶
Start by loading a smaller subset of the SQuAD dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.
from datasets import load_dataset
squad = load_dataset("squad", split="train[:10]")
Split the dataset's train
split into a train and test set with the train_test_split method:
squad = squad.train_test_split(test_size=0.2)
Then take a look at an example:
squad["train"][0]
{'id': '5733bf84d058e614000b61c1', 'title': 'University_of_Notre_Dame', 'context': "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a liberal newspaper, Common Sense was published. Likewise, in 2003, when other students believed that the paper showed a liberal bias, the conservative paper Irish Rover went into production. Neither paper is published as often as The Observer; however, all three are distributed to all students. Finally, in Spring 2008 an undergraduate journal for political science research, Beyond Politics, made its debut.", 'question': 'In what year did the student paper Common Sense begin publication at Notre Dame?', 'answers': {'text': ['1987'], 'answer_start': [908]}}
There are several important fields here:
answers
: the starting location of the answer token and the answer text.context
: background information from which the model needs to extract the answer.question
: the question a model should answer.
Preprocess¶
The next step is to load a DistilBERT tokenizer to process the question
and context
fields:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"distilbert-base-uncased-distilled-squad"
)
There are a few preprocessing steps particular to question answering tasks you should be aware of:
- Some examples in a dataset may have a very long
context
that exceeds the maximum input length of the model. To deal with longer sequences, truncate only thecontext
by settingtruncation="only_second"
. - Next, map the start and end positions of the answer to the original
context
by settingreturn_offset_mapping=True
. - With the mapping in hand, now you can find the start and end tokens of the answer. Use the sequence_ids method to
find which part of the offset corresponds to the
question
and which corresponds to thecontext
.
Here is how you can create a function to truncate and map the start and end tokens of the answer
to the context
:
def preprocess_function(examples):
"""Preprocess examples."""
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = answer["answer_start"][0] + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
# Find the start and end of the context
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# If the answer is not fully inside the context, label it (0, 0)
if (
offset[context_start][0] > end_char
or offset[context_end][1] < start_char
):
start_positions.append(0)
end_positions.append(0)
else:
# Otherwise it's the start and end token positions
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up the map
function by setting batched=True
to process multiple elements of the dataset at once. Remove any columns you don't need:
tokenized_squad = squad.map(
preprocess_function,
batched=True,
remove_columns=squad["train"].column_names,
)
Map: 0%| | 0/8 [00:00<?, ? examples/s]
Map: 0%| | 0/2 [00:00<?, ? examples/s]
Now create a batch of examples using DefaultDataCollator. Unlike other data collators in 🤗 Transformers, the DefaultDataCollator does not apply any additional preprocessing such as padding.
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator()
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Train¶
If you aren't familiar with finetuning a model with the Trainer, take a look at the basic tutorial here!
You're ready to start training your model now! Load DistilBERT with AutoModelForQuestionAnswering:
from transformers import AutoModelForQuestionAnswering, TrainingArguments
model = AutoModelForQuestionAnswering.from_pretrained(
"distilbert-base-uncased-distilled-squad"
)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false) huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either: - Avoid using `tokenizers` before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
###############################################
######### BEGIN STAINED GLASS CHANGES #########
###############################################
import stainedglass_core.utils.torch
from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer
input_shape = (-1, tokenizer.model_max_length)
model = sg_model.NoisyTransformerModel(
sg_noise_layer.CloakNoiseLayerOneShot,
model,
input_shape,
target_layer="distilbert.embeddings",
percent_to_mask=0.5,
scale=(1e-3, 1.0),
)
# By default, both the base model and the noise layer are trainable.
# To freeze the base model (only train the noise layer), uncomment the following line:
stainedglass_core.utils.torch.freeze(model.base_model)
# To freeze the noise layer (only train the base model), uncomment the following line:
# stainedglass.utils.torch.freeze(model.noise_layer)
###############################################
########## END STAINED GLASS CHANGES ##########
###############################################
At this point, only three steps remain:
- Define your training hyperparameters in TrainingArguments. The only required parameter is
output_dir
which specifies where to save your model. You'll push this model to the Hub by settingpush_to_hub=True
(you need to be signed in to Hugging Face to upload your model). - Pass the training arguments to Trainer along with the model, dataset, tokenizer, and data collator.
- Call train() to finetune your model.
NUM_TRAIN_EPOCHS = 2
import stainedglass_core.huggingface
training_args = TrainingArguments(
output_dir="my_awesome_qa_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=64,
per_device_eval_batch_size=64,
num_train_epochs=NUM_TRAIN_EPOCHS,
###############################################
######### BEGIN STAINED GLASS CHANGES #########
###############################################
# Weight Decay interacts with Stained Glass Transform training.
weight_decay=0.00,
###############################################
########## END STAINED GLASS CHANGES ##########
###############################################
report_to="none",
)
###############################################
######### BEGIN STAINED GLASS CHANGES #########
###############################################
# Use the StainedGlassTrainer instead of the Trainer
trainer = stainedglass_core.huggingface.StainedGlassTrainer(
model=model,
args=training_args,
train_dataset=tokenized_squad["train"],
eval_dataset=tokenized_squad["test"],
tokenizer=tokenizer,
data_collator=data_collator,
alpha=0.1, # Must specify an alpha value
)
###############################################
########## END STAINED GLASS CHANGES ##########
###############################################
trainer.train()
0%| | 0/2 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
{'eval_loss': 1.1793955564498901, 'eval_runtime': 0.1548, 'eval_samples_per_second': 12.919, 'eval_steps_per_second': 6.459, 'epoch': 1.0}
0%| | 0/1 [00:00<?, ?it/s]
{'eval_loss': 1.18039071559906, 'eval_runtime': 0.1049, 'eval_samples_per_second': 19.059, 'eval_steps_per_second': 9.529, 'epoch': 2.0} {'train_runtime': 2.4033, 'train_samples_per_second': 6.657, 'train_steps_per_second': 0.832, 'train_loss': 0.8599898815155029, 'epoch': 2.0}
TrainOutput(global_step=2, training_loss=0.8599898815155029, metrics={'train_runtime': 2.4033, 'train_samples_per_second': 6.657, 'train_steps_per_second': 0.832, 'train_loss': 0.8599898815155029, 'epoch': 2.0})