tokenization_utils

PromptType ¶

Bases: str, Enum

The type of prompt to use.

Added in version 0.57.0.

TokenizerWrapper ¶

Bases: Generic[TokenizerWrapperReturnT_co, SchemaT_contra]

Wraps a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.

Calling this class will return a dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable).

Examples:

Instantiating a pipeline for a Transformers model:

>>> model_id = "tests/resources/tokenizers/mini-Mistral-7B-Instruct-v0.2"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> config = transformers.AutoConfig.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_config(config)

Using a TokenizerWrapper for training a Stained Glass Transform model:

>>> noisy_tokenizer = TokenizerWrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
... )
>>> noisy_tokenizer(
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the following message.",
...         "response": "Hello!",
...         "system_prompt": "You are a helpful assistant.",
...     }
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model:

>>> noisy_tokenizer = TokenizerWrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
... )
>>> noisy_tokenizer(
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the following message.",
...         "response": "Hello!",
...         "system_prompt": "You are a helpful assistant.",
...     }
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

device `property` `writable` ¶

device: device | None

The device to use for the generated tensors.

This device will respect any map_location specified in torch.load or when loading a StainedGlassTransformForText model.

call ¶

__call__(schema: SchemaT_contra) -> TokenizerWrapperReturnT_co

Apply the pipeline to a schema, generating the tokenized prompt, the noise mask, and a loss mask (if applicable).

Parameters:

Name	Type	Description	Default
`schema`	`SchemaT_contra`	The schema to apply the pipeline to. This will be the typed dictionary `Schema` if the prompt type is `PromptType.INSTRUCTION`, or a sequence of schemas if the prompt type is `PromptType.CHAT`.	required

Returns:

Type	Description
`TokenizerWrapperReturnT_co`	A dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable). See the class and constructor
`TokenizerWrapperReturnT_co`	documentation for more details on the contents of the output dictionary for given constructor arguments.

init ¶

__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None

Wrap a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.

Parameters:

Name	Type	Description	Default
`tokenizer`	`PreTrainedTokenizerBase \| None`	The tokenizer to use. Can be `None` only if `tokenizer_mapper` is provided.	required
`model_type`	`type[PreTrainedModel] \| PreTrainedModel \| None`	Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the model will be inferred from the instance. Can be `None` only if `tokenizer_mapper` is provided.	required
`include_labels`	`bool`	Whether to include labels in the output.	required
`ignore_prompt_loss`	`bool`	Whether to ignore the prompt when generating the loss mask. Only relevant if `include_labels` is `False`.	required
`device`	`str \| int \| device \| None`	The device to cast the output tensors to. If None, the output tensors will not be casted to any device.	`None`
`always_include_context`	`bool`	Whether to always include the context start tokens in the tokenization output. If `False`, the context start tokens will only be included if the context is non-empty.	`False`
`tokenizer_mapper`	`TokenizerMapper \| MistralMultiturnTransformLayerMapper \| None`	The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model. If specified, it must be a subclass of `TokenizerMapper` with a compatible signature.	`None`
`transform_all_tokens`	`bool`	Whether to transform all tokens in the prompt.	`False`
`prompt_type`	`PromptType`	The type of prompt to use.	`<PromptType.INSTRUCTION: 'instruction'>`

Note

If include_labels is False, the resulting pipeline will return a dictionary also containing the key loss_mask. This key will contain a list of True and False, whereTrueindicate tokens that should be considered when calculating the distillation loss. This is useful for only considering the tokens that had Stained Glass Transform applied. Ifignore_prompt_lossisTrue`, the prompt will also not be considered when generating the loss mask, even if it had Stained Glass Transform applied to it.

Raises:

Type	Description
`ValueError`	If the model is not supported.
`ValueError`	If neither `tokenizer_mapper` nor both `tokenizer` and `model_type` are provided.
`ValueError`	If `tokenizer_mapper` is provided at the same time as either `tokenizer` or `model_type`.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

noise_mask_tokenizer_wrapper ¶

noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

Wrap a tokenizer with a pipeline that not only tokenizes a prompt, but also generates masks for tokens that should be masked while applying the Stained Glass Transform and/or while calculating model losses.

Parameters:

Name	Type	Description	Default
`tokenizer`	`PreTrainedTokenizerBase`	The tokenizer to use.	required
`model_type`	`type[PreTrainedModel] \| PreTrainedModel`	Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the model will be inferred from the instance.	required
`template_function`	`Callable[SchemaMapperP, SchemaT_contra]`	The callable to use to map the dataset element to an intermediate schema. The callable can have any signature, but must return a `Schema`.	required
`include_labels`	`bool`	Whether to include labels in the output.	required
`include_noise_mask`	`bool`	Whether to include a noise mask. The noise mask is 1 for tokens that should be changed by the Stained Glass Transform, and 0 otherwise.	required
`ignore_prompt_loss`	`bool`	Whether to ignore the prompt when generating the model loss mask. Only relevant if both `include_labels` and `include_noise_mask` are True.	required
`return_tensors`	`bool`	Whether to return the output as tensors or lists.	`False`
`device`	`str \| int \| device \| None`	The device to cast the output tensors to. If None, the output tensors will not be casted to any device.	`None`
`always_include_context`	`bool`	Whether to always include the context start tokens in the tokenization output. If `False`, the context start tokens will only be included if the context is non-empty.	`False`
`tokenizer_mapper`	`TokenizerMapper \| None`	The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model. If specified, it must be a subclass of `TokenizerMapper` with a compatible signature.	`None`
`transform_all_tokens`	`bool`	Whether to transform all tokens in the prompt.	`False`
`prompt_type`	`PromptType`	The type of prompt to use.	`<PromptType.INSTRUCTION: 'instruction'>`

Note

If include_labels==False and include_noise_mask==True, the resulting pipeline will return a dictionary also containing the key loss_mask. This key will contain a list of 1s and 0s, where 1s indicate tokens that should be considered when calculating the model loss. This is useful for only considering the tokens that had Stained Glass Transform applied to them. If ignore_prompt_loss is True, the prompt will also not be considered when generating the loss mask, even if it had Stained Glass Transform applied to it.

Returns:

Type Description

Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

A functional pipeline that tokenizes a prompt and generates masks for tokens that should be masked while applying the Stained Glass

Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

Transform and/or while calculating model losses. The input(s) of the pipeline will be the input(s) of the template_function.

Raises:

Type	Description
`ValueError`	If the model is not supported.

Examples:

Instantiating a pipeline for a Llama2 model:

>>> model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(model_id)

Template functions can take in any args/kwargs, but must return an Schema. See the docs for Schema for more details.

>>> import json
>>> def json_string_template_function(
...     user_message: str,
... ) -> universal.Schema:
...     '''Take a user message containing the json template below, and return an instruction schema.
...
...     ```
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the context.",
...         "response": "This is a response."
...     }
...     ```
...     '''
...     loaded_json = json.loads(user_message)
...     return {
...         "system_prompt": "You are a helpful assistant. Be courteous and helpful.",
...         "context": loaded_json["context"],
...         "instruction": loaded_json["instruction"],
...         "response": loaded_json["response"],
...     }

Using a Tokenizer Wrapper for training a Stained Glass Transform model:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=True,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
...     include_noise_mask=True,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}

Using a Tokenizer Wrapper for training a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=False,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
...     include_noise_mask=False,
...     return_tensors=True,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...)}

Using a Tokenizer Wrapper for training a Stained Glass Transform model, but returning lists instead of tensors. This can be useful when using a Hugging Face data collator that expects lists:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=True,
...     return_tensors=False,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': [...], 'noise_mask': [...], 'loss_mask': [...]}

Added in version 0.57.0.

Changed in version 0.72.0: Added `transform_all_tokens` parameter.

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Changed in version 0.77.0: Added parameter `prompt_type` to support chat template.

Deprecated since version 0.96.0. Use the `TokenizerWrapper` class instead, manually casting the input to an appropriate Schema.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

tokenization_utils

PromptType ¶

TokenizerWrapper ¶

device property writable ¶

__call__ ¶

__init__ ¶

noise_mask_tokenizer_wrapper ¶

device `property` `writable` ¶

call ¶

init ¶