Skip to content

tokenization_utils

PromptType

Bases: str, Enum

The type of prompt to use.

Added in version 0.57.0.

TokenizerWrapper

Bases: Generic[TokenizerWrapperReturnT_co, SchemaT_contra]

Wraps a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.

Calling this class will return a dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable).

Examples:

Instantiating a pipeline for a Transformers model:

>>> model_id = "tests/resources/tokenizers/mini-Mistral-7B-Instruct-v0.2"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> config = transformers.AutoConfig.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_config(config)

Using a TokenizerWrapper for training a Stained Glass Transform model:

>>> noisy_tokenizer = TokenizerWrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
... )
>>> noisy_tokenizer(
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the following message.",
...         "response": "Hello!",
...         "system_prompt": "You are a helpful assistant.",
...     }
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model:

>>> noisy_tokenizer = TokenizerWrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
... )
>>> noisy_tokenizer(
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the following message.",
...         "response": "Hello!",
...         "system_prompt": "You are a helpful assistant.",
...     }
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

device property writable

device: device | None

The device to use for the generated tensors.

This device will respect any map_location specified in torch.load or when loading a StainedGlassTransformForText model.

__call__

__call__(schema: SchemaT_contra) -> TokenizerWrapperReturnT_co

Apply the pipeline to a schema, generating the tokenized prompt, the noise mask, and a loss mask (if applicable).

Parameters:

Name Type Description Default
schema SchemaT_contra

The schema to apply the pipeline to. This will be the typed dictionary Schema if the prompt type is PromptType.INSTRUCTION, or a sequence of schemas if the prompt type is PromptType.CHAT.

required

Returns:

Type Description
TokenizerWrapperReturnT_co

A dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable). See the class and constructor

TokenizerWrapperReturnT_co

documentation for more details on the contents of the output dictionary for given constructor arguments.

__init__

__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None

Wrap a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.

Parameters:

Name Type Description Default
tokenizer PreTrainedTokenizerBase | None

The tokenizer to use. Can be None only if tokenizer_mapper is provided.

required
model_type type[PreTrainedModel] | PreTrainedModel | None

Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the model will be inferred from the instance. Can be None only if tokenizer_mapper is provided.

required
include_labels bool

Whether to include labels in the output.

required
ignore_prompt_loss bool

Whether to ignore the prompt when generating the loss mask. Only relevant if include_labels is False.

required
device str | int | device | None

The device to cast the output tensors to. If None, the output tensors will not be casted to any device.

None
always_include_context bool

Whether to always include the context start tokens in the tokenization output. If False, the context start tokens will only be included if the context is non-empty.

False
tokenizer_mapper TokenizerMapper | MistralMultiturnTransformLayerMapper | None

The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model. If specified, it must be a subclass of TokenizerMapper with a compatible signature.

None
transform_all_tokens bool

Whether to transform all tokens in the prompt.

False
prompt_type PromptType

The type of prompt to use.

<PromptType.INSTRUCTION: 'instruction'>
Note

If include_labels is False, the resulting pipeline will return a dictionary also containing the key loss_mask. This key will contain a list of True and False, whereTrueindicate tokens that should be considered when calculating the distillation loss. This is useful for only considering the tokens that had Stained Glass Transform applied. Ifignore_prompt_lossisTrue`, the prompt will also not be considered when generating the loss mask, even if it had Stained Glass Transform applied to it.

Raises:

Type Description
ValueError

If the model is not supported.

ValueError

If neither tokenizer_mapper nor both tokenizer and model_type are provided.

ValueError

If tokenizer_mapper is provided at the same time as either tokenizer or model_type.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

noise_mask_tokenizer_wrapper

noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

Wrap a tokenizer with a pipeline that not only tokenizes a prompt, but also generates masks for tokens that should be masked while applying the Stained Glass Transform and/or while calculating model losses.

Parameters:

Name Type Description Default
tokenizer PreTrainedTokenizerBase

The tokenizer to use.

required
model_type type[PreTrainedModel] | PreTrainedModel

Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the model will be inferred from the instance.

required
template_function Callable[SchemaMapperP, SchemaT_contra]

The callable to use to map the dataset element to an intermediate schema. The callable can have any signature, but must return a Schema.

required
include_labels bool

Whether to include labels in the output.

required
include_noise_mask bool

Whether to include a noise mask. The noise mask is 1 for tokens that should be changed by the Stained Glass Transform, and 0 otherwise.

required
ignore_prompt_loss bool

Whether to ignore the prompt when generating the model loss mask. Only relevant if both include_labels and include_noise_mask are True.

required
return_tensors bool

Whether to return the output as tensors or lists.

False
device str | int | device | None

The device to cast the output tensors to. If None, the output tensors will not be casted to any device.

None
always_include_context bool

Whether to always include the context start tokens in the tokenization output. If False, the context start tokens will only be included if the context is non-empty.

False
tokenizer_mapper TokenizerMapper | None

The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model. If specified, it must be a subclass of TokenizerMapper with a compatible signature.

None
transform_all_tokens bool

Whether to transform all tokens in the prompt.

False
prompt_type PromptType

The type of prompt to use.

<PromptType.INSTRUCTION: 'instruction'>
Note

If include_labels==False and include_noise_mask==True, the resulting pipeline will return a dictionary also containing the key loss_mask. This key will contain a list of 1s and 0s, where 1s indicate tokens that should be considered when calculating the model loss. This is useful for only considering the tokens that had Stained Glass Transform applied to them. If ignore_prompt_loss is True, the prompt will also not be considered when generating the loss mask, even if it had Stained Glass Transform applied to it.

Returns:

Type Description
Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

A functional pipeline that tokenizes a prompt and generates masks for tokens that should be masked while applying the Stained Glass

Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

Transform and/or while calculating model losses. The input(s) of the pipeline will be the input(s) of the template_function.

Raises:

Type Description
ValueError

If the model is not supported.

Examples:

Instantiating a pipeline for a Llama2 model:

>>> model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(model_id)

Template functions can take in any args/kwargs, but must return an Schema. See the docs for Schema for more details.

>>> import json
>>> def json_string_template_function(
...     user_message: str,
... ) -> universal.Schema:
...     '''Take a user message containing the json template below, and return an instruction schema.
...
...     ```
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the context.",
...         "response": "This is a response."
...     }
...     ```
...     '''
...     loaded_json = json.loads(user_message)
...     return {
...         "system_prompt": "You are a helpful assistant. Be courteous and helpful.",
...         "context": loaded_json["context"],
...         "instruction": loaded_json["instruction"],
...         "response": loaded_json["response"],
...     }

Using a Tokenizer Wrapper for training a Stained Glass Transform model:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=True,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
...     include_noise_mask=True,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}

Using a Tokenizer Wrapper for training a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=False,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
...     include_noise_mask=False,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...)}

Using a Tokenizer Wrapper for training a Stained Glass Transform model, but returning lists instead of tensors. This can be useful when using a Hugging Face data collator that expects lists:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=True,
...     return_tensors=False,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': [...], 'noise_mask': [...], 'loss_mask': [...]}

Added in version 0.57.0.

Changed in version 0.72.0: Added `transform_all_tokens` parameter.

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Changed in version 0.77.0: Added parameter `prompt_type` to support chat template.

Deprecated since version 0.96.0. Use the `TokenizerWrapper` class instead, manually casting the input to an appropriate Schema.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper