tokenizer_wrapper
Classes:
Name | Description |
---|---|
PromptType |
The type of prompt to use. |
Functions:
Name | Description |
---|---|
noise_mask_tokenizer_wrapper |
Wrap a tokenizer with a pipeline that not only tokenizes a prompt, but also generates masks for tokens that should be masked while |
TokenizerWrapper
¶
Bases: Generic[TokenizerWrapperReturnT_co, SchemaT_contra]
Wraps a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.
Calling this class will return a dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable).
Examples:
Instantiating a pipeline for a Transformers model:
>>> model_id = "tests/resources/tokenizers/mini-Mistral-7B-Instruct-v0.2"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> config = transformers.AutoConfig.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_config(config)
Using a TokenizerWrapper
for training a Stained Glass Transform model:
>>> noisy_tokenizer = TokenizerWrapper(
... tokenizer=tokenizer,
... model_type=type(base_model),
... transform_all_tokens=False,
... ignore_prompt_loss=True,
... include_labels=False,
... prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
... {
... "context": "This is a context.",
... "instruction": "Please respond to the following message.",
... "response": "Hello!",
... "system_prompt": "You are a helpful assistant.",
... }
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}
Using a Tokenizer Wrapper for testing a Stained Glass Transform model:
>>> noisy_tokenizer = TokenizerWrapper(
... tokenizer=tokenizer,
... model_type=type(base_model),
... transform_all_tokens=False,
... ignore_prompt_loss=True,
... include_labels=True,
... prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
... {
... "context": "This is a context.",
... "instruction": "Please respond to the following message.",
... "response": "Hello!",
... "system_prompt": "You are a helpful assistant.",
... }
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}
Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper
Methods:
Name | Description |
---|---|
__call__ |
Apply the pipeline to a schema, generating the tokenized prompt, the noise mask, and a loss mask (if applicable). |
__init__ |
Wrap a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or |
Attributes:
Name | Type | Description |
---|---|---|
device |
device | None
|
The device to use for the generated tensors. |
device
property
writable
¶
device: device | None
The device to use for the generated tensors.
This device will respect any map_location
specified in torch.load
or when loading a
StainedGlassTransformForText
model.
__call__
¶
__call__(
schema: SchemaT_contra,
) -> TokenizerWrapperReturnT_co
Apply the pipeline to a schema, generating the tokenized prompt, the noise mask, and a loss mask (if applicable).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
SchemaT_contra
|
The schema to apply the pipeline to. This will be the typed dictionary
|
required |
Returns:
Type | Description |
---|---|
TokenizerWrapperReturnT_co
|
A dictionary containing the tokenized prompt, noise mask, and loss mask (if applicable). See the class and constructor |
TokenizerWrapperReturnT_co
|
documentation for more details on the contents of the output dictionary for given constructor arguments. |
__init__
¶
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
__init__(tokenizer: PreTrainedTokenizerBase | None, model_type: type[PreTrainedModel] | PreTrainedModel | None, include_labels: bool, ignore_prompt_loss: bool, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | MistralMultiturnTransformLayerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> None
Wrap a tokenizer with a pipeline that tokenizes a prompt and generates token masks for applying the Stained Glass Transform and/or calculating losses.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
PreTrainedTokenizerBase | None
|
The tokenizer to use. Can be |
required |
|
type[PreTrainedModel] | PreTrainedModel | None
|
Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the
model will be inferred from the instance. Can be |
required |
|
bool
|
Whether to include labels in the output. |
required |
|
bool
|
Whether to ignore the prompt when generating the loss mask. Only relevant if |
required |
|
str | int | device | None
|
The device to cast the output tensors to. If None, the output tensors will not be casted to any device. |
None
|
|
bool
|
Whether to always include the context start tokens in the tokenization output. If |
False
|
|
TokenizerMapper | MistralMultiturnTransformLayerMapper | None
|
The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model.
If specified, it must be a subclass of
|
None
|
|
bool
|
Whether to transform all tokens in the prompt. |
False
|
|
PromptType
|
The type of prompt to use. |
<PromptType.INSTRUCTION: 'instruction'>
|
Note
If include_labels
is False
, the resulting pipeline will return a dictionary also containing the key loss_mask
. This key
will contain a list of True
and False, where
Trueindicate tokens that should be considered when calculating the
distillation loss. This is useful for only considering the tokens that had Stained Glass Transform applied. If
ignore_prompt_lossis
True`, the prompt will also not be considered when generating the loss mask, even if it had Stained
Glass Transform applied to it.
Raises:
Type | Description |
---|---|
ValueError
|
If the model is not supported. |
ValueError
|
If neither |
ValueError
|
If |
Changed in version v0.128.0: Add support for the pretrain prompt type.
Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper
noise_mask_tokenizer_wrapper
¶
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
noise_mask_tokenizer_wrapper(tokenizer: PreTrainedTokenizerBase, model_type: type[PreTrainedModel] | PreTrainedModel, template_function: Callable[SchemaMapperP, SchemaT_contra], include_labels: bool, include_noise_mask: bool, ignore_prompt_loss: bool, return_tensors: bool = False, device: str | int | device | None = None, always_include_context: bool = False, tokenizer_mapper: TokenizerMapper | None = None, transform_all_tokens: bool = False, prompt_type: PromptType = <PromptType.INSTRUCTION: 'instruction'>) -> Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
Wrap a tokenizer with a pipeline that not only tokenizes a prompt, but also generates masks for tokens that should be masked while applying the Stained Glass Transform and/or while calculating model losses.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
PreTrainedTokenizerBase
|
The tokenizer to use. |
required |
|
type[PreTrainedModel] | PreTrainedModel
|
Type of the model to use for inferring the correct schema mapper. If a model instance is provided, the type of the model will be inferred from the instance. |
required |
|
Callable[SchemaMapperP, SchemaT_contra]
|
The callable to use to map the dataset element to an intermediate schema. The callable can have any signature,
but must return a |
required |
|
bool
|
Whether to include labels in the output. |
required |
|
bool
|
Whether to include a noise mask. The noise mask is 1 for tokens that should be changed by the Stained Glass Transform, and 0 otherwise. |
required |
|
bool
|
Whether to ignore the prompt when generating the model loss mask. Only relevant if both |
required |
|
bool
|
Whether to return the output as tensors or lists. |
False
|
|
str | int | device | None
|
The device to cast the output tensors to. If None, the output tensors will not be casted to any device. |
None
|
|
bool
|
Whether to always include the context start tokens in the tokenization output. If |
False
|
|
TokenizerMapper | None
|
The tokenizer mapper to use. If not provided, the pipeline will infer the tokenizer mapper from the model.
If specified, it must be a subclass of
|
None
|
|
bool
|
Whether to transform all tokens in the prompt. |
False
|
|
PromptType
|
The type of prompt to use. |
<PromptType.INSTRUCTION: 'instruction'>
|
Note
If include_labels==False
and include_noise_mask==True
, the resulting pipeline will return a dictionary also containing
the key loss_mask
. This key will contain a list of 1s and 0s, where 1s indicate tokens that should be considered when calculating
the model loss. This is useful for only considering the tokens that had Stained Glass Transform applied to them. If
ignore_prompt_loss
is True, the prompt will also not be considered when generating the loss mask, even if it had Stained Glass
Transform applied to it.
Returns:
Type | Description |
---|---|
Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
|
A functional pipeline that tokenizes a prompt and generates masks for tokens that should be masked while applying the Stained Glass |
Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[torch.Tensor]] | Callable[SchemaMapperP, universal.TransformLayerTestMapper.TransformLayerTestInput[list[int]]] | Callable[SchemaMapperP, universal.TestMapper.TestInput[list[int]]] | Callable[SchemaMapperP, universal.TransformLayerTrainMapper.TransformLayerTrainInput[list[int]]] | Callable[SchemaMapperP, universal.TrainMapper.TrainInput[list[int]]]
|
Transform and/or while calculating model losses. The input(s) of the pipeline will be the input(s) of the |
Raises:
Type | Description |
---|---|
ValueError
|
If the model is not supported. |
Examples:
Instantiating a pipeline for a Llama2 model:
>>> model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(model_id)
Template functions can take in any args/kwargs, but must return an Schema
. See the docs for
Schema
for
more details.
>>> import json
>>> def json_string_template_function(
... user_message: str,
... ) -> universal.Schema:
... '''Take a user message containing the json template below, and return an instruction schema.
...
... ```
... {
... "context": "This is a context.",
... "instruction": "Please respond to the context.",
... "response": "This is a response."
... }
... ```
... '''
... loaded_json = json.loads(user_message)
... return {
... "system_prompt": "You are a helpful assistant. Be courteous and helpful.",
... "context": loaded_json["context"],
... "instruction": loaded_json["instruction"],
... "response": loaded_json["response"],
... }
Using a Tokenizer Wrapper for training a Stained Glass Transform model:
>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
... tokenizer=tokenizer,
... model_type=type(base_model),
... template_function=json_string_template_function,
... transform_all_tokens=False,
... ignore_prompt_loss=True,
... include_labels=False,
... include_noise_mask=True,
... return_tensors=True,
... prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
... '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'noise_mask': tensor(...), 'loss_mask': tensor(...)}
Using a Tokenizer Wrapper for testing a Stained Glass Transform model:
>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
... tokenizer=tokenizer,
... model_type=type(base_model),
... template_function=json_string_template_function,
... transform_all_tokens=False,
... ignore_prompt_loss=True,
... include_labels=True,
... include_noise_mask=True,
... return_tensors=True,
... prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
... '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...), 'noise_mask': tensor(...)}
Using a Tokenizer Wrapper for training a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):
>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
... tokenizer=tokenizer,
... model_type=type(base_model),
... template_function=json_string_template_function,
... transform_all_tokens=False,
... ignore_prompt_loss=True,
... include_labels=False,
... include_noise_mask=False,
... return_tensors=True,
... prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
... '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...)}
Using a Tokenizer Wrapper for testing a Stained Glass Transform model without noise mask (or for tuning a base model without Stained Glass Transform, less common):
>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
... tokenizer=tokenizer,
... model_type=type(base_model),
... template_function=json_string_template_function,
... transform_all_tokens=False,
... ignore_prompt_loss=True,
... include_labels=True,
... include_noise_mask=False,
... return_tensors=True,
... prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
... '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': tensor(...), 'labels': tensor(...)}
Using a Tokenizer Wrapper for training a Stained Glass Transform model, but returning lists instead of tensors. This can be useful when using a Hugging Face data collator that expects lists:
>>> noisy_tokenizer = noise_mask_tokenizer_wrapper(
... tokenizer=tokenizer,
... model_type=type(base_model),
... template_function=json_string_template_function,
... transform_all_tokens=False,
... ignore_prompt_loss=True,
... include_labels=False,
... include_noise_mask=True,
... return_tensors=False,
... prompt_type=PromptType.INSTRUCTION,
... )
>>> noisy_tokenizer(
... '{"context": "This is a context.", "instruction": "Please respond to the following message.", "response": "Hello!"}'
... )
{'input_ids': [...], 'noise_mask': [...], 'loss_mask': [...]}
Added in version 0.57.0.
Changed in version 0.72.0: Added `transform_all_tokens` parameter.
Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.
Changed in version 0.77.0: Added parameter `prompt_type` to support chat template.
Deprecated since version 0.96.0. Use the `TokenizerWrapper` class instead, manually casting the input to an appropriate Schema.
Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper