Skip to content

tokenizer_wrapper

Classes:

Name Description
PromptType

The type of prompt to use.

Functions:

Name Description
noise_mask_tokenizer_wrapper

Wrap a tokenizer with a pipeline that not only tokenizes a prompt, but also generates masks for tokens that should be masked while

PromptType

Bases: str, Enum

The type of prompt to use.

Added in version 0.57.0.

TokenizerWrapper

Bases: Generic[TokenizerWrapperReturnT_co, SchemaT_contra]

Wraps a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.

Calling this class will return a dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable).

Examples:

Instantiating a pipeline for a Transformers model:

>>> model_id = "tests/resources/tokenizers/mini-Mistral-7B-Instruct-v0.2"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> config = transformers.AutoConfig.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_config(config)

Using a TokenizerWrapper for training a Stained Glass Transform model:

>>> noisy_tokenizer = TokenizerWrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the following message.",
...         "response": "Hello!",
...         "system_prompt": "You are a helpful assistant.",
...     }
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model:

>>> noisy_tokenizer = TokenizerWrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
...     prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the following message.",
...         "response": "Hello!",
...         "system_prompt": "You are a helpful assistant.",
...     }
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

Methods:

Name Description
__call__

Apply the pipeline to a schema, generating the tokenized prompt, the noise mask, and a loss mask (if applicable).

__init__

Wrap a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or

Attributes:

Name Type Description
device device | None

The device to use for the generated tensors.

device property writable

device: device | None

The device to use for the generated tensors.

This device will respect any map_location specified in torch.load or when loading a StainedGlassTransformForText model.

__call__

__call__(
    schema: SchemaT_contra,
) -> TokenizerWrapperReturnT_co

Apply the pipeline to a schema, generating the tokenized prompt, the noise mask, and a loss mask (if applicable).

Parameters:

Name Type Description Default

schema

SchemaT_contra

The schema to apply the pipeline to. This will be the typed dictionary Schema if the prompt type is PromptType.INSTRUCTION or PromptType.PreTrain, or a sequence of schemas if the prompt type is PromptType.CHAT.

required

Returns:

Type Description
TokenizerWrapperReturnT_co

A dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable). See the class and constructor

TokenizerWrapperReturnT_co

documentation for more details on the contents of the output dictionary for given constructor arguments.

__init__

Wrap a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.

Parameters:

Name Type Description Default

tokenizer

PreTrainedTokenizerBase | None

The tokenizer to use. Can be None only if tokenizer_mapper is provided.

required

model_type

type[PreTrainedModel] | PreTrainedModel | None

Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the model will be inferred from the instance. Can be None only if tokenizer_mapper is provided.

required

include_labels

bool

Whether to include labels in the output.

required

ignore_prompt_loss

bool

Whether to ignore the prompt when generating the loss mask. Only relevant if include_labels is False.

required

device

str | int | device | None

The device to cast the output tensors to. If None, the output tensors will not be casted to any device.

None

always_include_context

bool

Whether to always include the context start tokens in the tokenization output. If False, the context start tokens will only be included if the context is non-empty.

False

tokenizer_mapper

TokenizerMapper | MistralMultiturnTransformLayerMapper | None

The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model. If specified, it must be a subclass of TokenizerMapper with a compatible signature.

None

transform_all_tokens

bool

Whether to transform all tokens in the prompt.

False

prompt_type

PromptType

The type of prompt to use.

<PromptType.INSTRUCTION: 'instruction'>
Note

If include_labels is False, the resulting pipeline will return a dictionary also containing the key loss_mask. This key will contain a list of True and False, whereTrueindicate tokens that should be considered when calculating the distillation loss. This is useful for only considering the tokens that had Stained Glass Transform applied. Ifignore_prompt_lossisTrue`, the prompt will also not be considered when generating the loss mask, even if it had Stained Glass Transform applied to it.

Raises:

Type Description
ValueError

If the model is not supported.

ValueError

If neither tokenizer_mapper nor both tokenizer and model_type are provided.

ValueError

If tokenizer_mapper is provided at the same time as either tokenizer or model_type.

Changed in version v0.128.0: Add support for the pretrain prompt type.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

noise_mask_tokenizer_wrapper

noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

Wrap a tokenizer with a pipeline that not only tokenizes a prompt, but also generates masks for tokens that should be masked while applying the Stained Glass Transform and/or while calculating model losses.

Parameters:

Name Type Description Default

tokenizer

PreTrainedTokenizerBase

The tokenizer to use.

required

model_type

type[PreTrainedModel] | PreTrainedModel

Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the model will be inferred from the instance.

required

template_function

Callable[SchemaMapperP, SchemaT_contra]

The callable to use to map the dataset element to an intermediate schema. The callable can have any signature, but must return a Schema.

required

include_labels

bool

Whether to include labels in the output.

required

include_noise_mask

bool

Whether to include a noise mask. The noise mask is 1 for tokens that should be changed by the Stained Glass Transform, and 0 otherwise.

required

ignore_prompt_loss

bool

Whether to ignore the prompt when generating the model loss mask. Only relevant if both include_labels and include_noise_mask are True.

required

return_tensors

bool

Whether to return the output as tensors or lists.

False

device

str | int | device | None

The device to cast the output tensors to. If None, the output tensors will not be casted to any device.

None

always_include_context

bool

Whether to always include the context start tokens in the tokenization output. If False, the context start tokens will only be included if the context is non-empty.

False

tokenizer_mapper

TokenizerMapper | None

The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model. If specified, it must be a subclass of TokenizerMapper with a compatible signature.

None

transform_all_tokens

bool

Whether to transform all tokens in the prompt.

False

prompt_type

PromptType

The type of prompt to use.

<PromptType.INSTRUCTION: 'instruction'>
Note

If include_labels==False and include_noise_mask==True, the resulting pipeline will return a dictionary also containing the key loss_mask. This key will contain a list of 1s and 0s, where 1s indicate tokens that should be considered when calculating the model loss. This is useful for only considering the tokens that had Stained Glass Transform applied to them. If ignore_prompt_loss is True, the prompt will also not be considered when generating the loss mask, even if it had Stained Glass Transform applied to it.

Returns:

Type Description
Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

A functional pipeline that tokenizes a prompt and generates masks for tokens that should be masked while applying the Stained Glass

Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]

Transform and/or while calculating model losses. The input(s) of the pipeline will be the input(s) of the template_function.

Raises:

Type Description
ValueError

If the model is not supported.

Examples:

Instantiating a pipeline for a Llama2 model:

>>> model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(model_id)

Template functions can take in any args/kwargs, but must return an Schema. See the docs for Schema for more details.

>>> import json
>>> def json_string_template_function(
...     user_message: str,
... ) -> universal.Schema:
...     '''Take a user message containing the json template below, and return an instruction schema.
...
...     ```
...     {
...         "context": "This is a context.",
...         "instruction": "Please respond to the context.",
...         "response": "This is a response."
...     }
...     ```
...     '''
...     loaded_json = json.loads(user_message)
...     return {
...         "system_prompt": "You are a helpful assistant. Be courteous and helpful.",
...         "context": loaded_json["context"],
...         "instruction": loaded_json["instruction"],
...         "response": loaded_json["response"],
...     }

Using a Tokenizer Wrapper for training a Stained Glass Transform model:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=True,
...     return_tensors=True,
...     prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
...     include_noise_mask=True,
...     return_tensors=True,
...     prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}

Using a Tokenizer Wrapper for training a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=False,
...     return_tensors=True,
...     prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...)}

Using a Tokenizer Wrapper for testing a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=True,
...     include_noise_mask=False,
...     return_tensors=True,
...     prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...)}

Using a Tokenizer Wrapper for training a Stained Glass Transform model, but returning lists instead of tensors. This can be useful when using a Hugging Face data collator that expects lists:

>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
...     tokenizer=tokenizer,
...     model_type=type(base_model),
...     template_function=json_string_template_function,
...     transform_all_tokens=False,
...     ignore_prompt_loss=True,
...     include_labels=False,
...     include_noise_mask=True,
...     return_tensors=False,
...     prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
...     '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': [...], 'noise_mask': [...], 'loss_mask': [...]}

Added in version 0.57.0.

Changed in version 0.72.0: Added `transform_all_tokens` parameter.

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Changed in version 0.77.0: Added parameter `prompt_type` to support chat template.

Deprecated since version 0.96.0. Use the `TokenizerWrapper` class instead, manually casting the input to an appropriate Schema.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper