Hugging Face trainer integration¶

This notebook explores how to integrate Stained Glass Core with Hugging Face trainers. This notebook trains a DistilBERT model on Extractive Question/Answering. This notebook is adapted from the official Hugging Face Question answering tutorial.

In [ ]:

Copied!

%pip install transformers datasets
%pip install transformers datasets

Question answering tasks return an answer given a question. If you've ever asked a virtual assistant like Alexa, Siri or Google what the weather is, then you've used a question answering model before. There are two common types of question answering tasks:

Extractive: extract the answer from the given context.
Abstractive: generate an answer from the context that correctly answers the question.

This guide will show you how to:

Finetune DistilBERT on the SQuAD dataset for extractive question answering.
Use your finetuned model for inference.

The task illustrated in this tutorial is supported by the following model architectures:

ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BLOOM, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, FlauBERT, FNet, Funnel Transformer, OpenAI GPT-2, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLMv2, LayoutLMv3, LED, LiLT, Longformer, LUKE, LXMERT, MarkupLM, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MVP, Nezha, Nyströmformer, OPT, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, Splinter, SqueezeBERT, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO

Before you begin, make sure you have all the necessary libraries installed:

pip install transformers datasets

Load SQuAD dataset¶

Start by loading a smaller subset of the SQuAD dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset.

In [4]:

Copied!

from datasets import load_dataset

squad = load_dataset("squad", split="train[:10]")
from datasets import load_dataset

squad = load_dataset("squad", split="train[:10]")

Split the dataset's train split into a train and test set with the train_test_split method:

In [5]:

Copied!

squad = squad.train_test_split(test_size=0.2)
squad = squad.train_test_split(test_size=0.2)

Then take a look at an example:

In [6]:

Copied!

squad["train"][0]
squad["train"][0]

Out[6]:

{'id': '5733bf84d058e614000b61c1',
'title': 'University_of_Notre_Dame',
'context': "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a radio and television station, and several magazines and journals. Begun as a one-page journal in September 1876, the Scholastic magazine is issued twice monthly and claims to be the oldest continuous collegiate publication in the United States. The other magazine, The Juggler, is released twice a year and focuses on student literature and artwork. The Dome yearbook is published annually. The newspapers have varying publication interests, with The Observer published daily and mainly reporting university and other news, and staffed by students from both Notre Dame and Saint Mary's College. Unlike Scholastic and The Dome, The Observer is an independent publication and does not have a faculty advisor or any editorial oversight from the University. In 1987, when some students believed that The Observer began to show a conservative bias, a liberal newspaper, Common Sense was published. Likewise, in 2003, when other students believed that the paper showed a liberal bias, the conservative paper Irish Rover went into production. Neither paper is published as often as The Observer; however, all three are distributed to all students. Finally, in Spring 2008 an undergraduate journal for political science research, Beyond Politics, made its debut.",
'question': 'In what year did the student paper Common Sense begin publication at Notre Dame?',
'answers': {'text': ['1987'], 'answer_start': [908]}}

There are several important fields here:

answers: the starting location of the answer token and the answer text.
context: background information from which the model needs to extract the answer.
question: the question a model should answer.

Preprocess¶

The next step is to load a DistilBERT tokenizer to process the question and context fields:

In [8]:

Copied!

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-distilled-squad"
)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "distilbert-base-uncased-distilled-squad"
)

There are a few preprocessing steps particular to question answering tasks you should be aware of:

Some examples in a dataset may have a very long context that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the context by setting truncation="only_second".
Next, map the start and end positions of the answer to the original context by setting return_offset_mapping=True.
With the mapping in hand, now you can find the start and end tokens of the answer. Use the sequence_ids method to find which part of the offset corresponds to the question and which corresponds to the context.

Here is how you can create a function to truncate and map the start and end tokens of the answer to the context:

In [9]:

Copied!





def preprocess_function(examples):
    """Preprocess examples."""
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if (
            offset[context_start][0] > end_char
            or offset[context_end][1] < start_char
        ):
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs
def preprocess_function(examples):
    """Preprocess examples."""
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if (
            offset[context_start][0] > end_char
            or offset[context_end][1] < start_char
        ):
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once. Remove any columns you don't need:

In [10]:

Copied!





tokenized_squad = squad.map(
    preprocess_function,
    batched=True,
    remove_columns=squad["train"].column_names,
)
tokenized_squad = squad.map(
    preprocess_function,
    batched=True,
    remove_columns=squad["train"].column_names,
)

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Now create a batch of examples using DefaultDataCollator. Unlike other data collators in 🤗 Transformers, the DefaultDataCollator does not apply any additional preprocessing such as padding.

In [11]:

Copied!

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

Train¶

If you aren't familiar with finetuning a model with the Trainer, take a look at the basic tutorial here!

You're ready to start training your model now! Load DistilBERT with AutoModelForQuestionAnswering:

In [12]:

Copied!

from transformers import AutoModelForQuestionAnswering, TrainingArguments

model = AutoModelForQuestionAnswering.from_pretrained(
    "distilbert-base-uncased-distilled-squad"
)
from transformers import AutoModelForQuestionAnswering, TrainingArguments

model = AutoModelForQuestionAnswering.from_pretrained(
    "distilbert-base-uncased-distilled-squad"
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

In [13]:

Copied!





###############################################
######### BEGIN STAINED GLASS CHANGES #########
###############################################

import stainedglass_core.utils.torch
from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer

input_shape = (-1, tokenizer.model_max_length)

model = sg_model.NoisyTransformerModel(
    sg_noise_layer.CloakNoiseLayerOneShot,
    model,
    input_shape,
    target_layer="distilbert.embeddings",
    percent_to_mask=0.5,
    scale=(1e-3, 1.0),
)

# By default, both the base model and the noise layer are trainable.

# To freeze the base model (only train the noise layer), uncomment the following line:
stainedglass_core.utils.torch.freeze(model.base_model)

# To freeze the noise layer (only train the base model), uncomment the following line:
# stainedglass.utils.torch.freeze(model.noise_layer)

###############################################
########## END STAINED GLASS CHANGES ##########
###############################################
###############################################
######### BEGIN STAINED GLASS CHANGES #########
###############################################

import stainedglass_core.utils.torch
from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer

input_shape = (-1, tokenizer.model_max_length)

model = sg_model.NoisyTransformerModel(
    sg_noise_layer.CloakNoiseLayerOneShot,
    model,
    input_shape,
    target_layer="distilbert.embeddings",
    percent_to_mask=0.5,
    scale=(1e-3, 1.0),
)

# By default, both the base model and the noise layer are trainable.

# To freeze the base model (only train the noise layer), uncomment the following line:
stainedglass_core.utils.torch.freeze(model.base_model)

# To freeze the noise layer (only train the base model), uncomment the following line:
# stainedglass.utils.torch.freeze(model.noise_layer)

###############################################
########## END STAINED GLASS CHANGES ##########
###############################################

At this point, only three steps remain:

Define your training hyperparameters in TrainingArguments. The only required parameter is output_dir which specifies where to save your model. You'll push this model to the Hub by setting push_to_hub=True (you need to be signed in to Hugging Face to upload your model).
Pass the training arguments to Trainer along with the model, dataset, tokenizer, and data collator.
Call train() to finetune your model.

In [ ]:

Copied!

NUM_TRAIN_EPOCHS = 2
NUM_TRAIN_EPOCHS = 2

In [14]:

Copied!





import stainedglass_core.huggingface

training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    ###############################################
    ######### BEGIN STAINED GLASS CHANGES #########
    ###############################################
    # Weight Decay interacts with Stained Glass Transform training.
    weight_decay=0.00,
    ###############################################
    ########## END STAINED GLASS CHANGES ##########
    ###############################################
    report_to="none",
)

###############################################
######### BEGIN STAINED GLASS CHANGES #########
###############################################

# Use the StainedGlassTrainer instead of the Trainer
trainer = stainedglass_core.huggingface.StainedGlassTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    alpha=0.1,  # Must specify an alpha value
)

###############################################
########## END STAINED GLASS CHANGES ##########
###############################################

trainer.train()
import stainedglass_core.huggingface

training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=NUM_TRAIN_EPOCHS,
    ###############################################
    ######### BEGIN STAINED GLASS CHANGES #########
    ###############################################
    # Weight Decay interacts with Stained Glass Transform training.
    weight_decay=0.00,
    ###############################################
    ########## END STAINED GLASS CHANGES ##########
    ###############################################
    report_to="none",
)

###############################################
######### BEGIN STAINED GLASS CHANGES #########
###############################################

# Use the StainedGlassTrainer instead of the Trainer
trainer = stainedglass_core.huggingface.StainedGlassTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    alpha=0.1,  # Must specify an alpha value
)

###############################################
########## END STAINED GLASS CHANGES ##########
###############################################

trainer.train()

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.1793955564498901, 'eval_runtime': 0.1548, 'eval_samples_per_second': 12.919, 'eval_steps_per_second': 6.459, 'epoch': 1.0}

  0%|          | 0/1 [00:00<?, ?it/s]

{'eval_loss': 1.18039071559906, 'eval_runtime': 0.1049, 'eval_samples_per_second': 19.059, 'eval_steps_per_second': 9.529, 'epoch': 2.0}
{'train_runtime': 2.4033, 'train_samples_per_second': 6.657, 'train_steps_per_second': 0.832, 'train_loss': 0.8599898815155029, 'epoch': 2.0}

Out[14]:

TrainOutput(global_step=2, training_loss=0.8599898815155029, metrics={'train_runtime': 2.4033, 'train_samples_per_second': 6.657, 'train_steps_per_second': 0.832, 'train_loss': 0.8599898815155029, 'epoch': 2.0})