noise_tokenizer

Classes:

Name	Description
`NoiseEncoding`	A dictionary that contains the tokenized input IDs, noise mask, attention mask, and loss mask.
`NoiseTokenizer`	Augments tokenized text data with special tokens and masks for noise injection and loss computation.
`TokenizerKwargs`	Keyword arguments passed to `apply_chat_template` as `tokenizer_kwargs`.

NoiseEncoding ¶

Bases: TypedDict

A dictionary that contains the tokenized input IDs, noise mask, attention mask, and loss mask.

NoiseTokenizer ¶

Augments tokenized text data with special tokens and masks for noise injection and loss computation.

Parameters:

Name	Type	Description	Default
`tokenizer` ¶	`PreTrainedTokenizerBase`	A Hugging Face tokenizer instance that will be used for tokenization and encoding.	required

Added in version v0.139.0. A simplified api to tokenize messages for Stained Glass Transform.

Methods:

Name	Description
`apply_chat_template`	Apply a chat template to a conversation or a batch of conversations augmenting the input with special tokens and generating

apply_chat_template ¶

apply_chat_template(
    conversation: list[dict[str, str]]
    | list[list[dict[str, str]]],
    chat_template: str | None = None,
    add_generation_prompt: bool = False,
    continue_final_message: bool = False,
    padding: Literal["longest", "max_length", True] = True,
    truncation: bool = False,
    max_length: int | None = None,
    ignore_prompt_loss: bool = False,
    transform_all_tokens: bool = False,
    return_attention_mask: bool = False,
    return_loss_mask: bool = True,
    tokenizer_kwargs: TokenizerKwargs | None = None,
) -> NoiseEncoding

Apply a chat template to a conversation or a batch of conversations augmenting the input with special tokens and generating masks for noise and loss computation.

Warning

Messages that have training/leading whitespace, or that are equal to the empty string are not guaranteed to produce identical token_ids as the underlying tokenizer. This is because some tokenizers (e.g. LlamaTokenizer) strip the message when applying the chat template.

Parameters:

Name	Type	Description	Default
`conversation` ¶	`list[dict[str, str]] \| list[list[dict[str, str]]]`	A list of dicts with "role" and "content" keys, representing the chat history so far.	required
`chat_template` ¶	`str \| None`	A Jinja template to use for this conversion. It is usually not necessary to pass anything to this argument, as	`None`
`add_generation_prompt` ¶	`bool`	If this is set, a prompt with the token(s) that indicate the start of an assistant message will be appended to the formatted output. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.	`False`
`continue_final_message` ¶	`bool`	If this is set, the chat will be formatted so that the final message in the chat is open-ended, without any EOS tokens. The model will continue this message rather than starting a new one. This allows you to "prefill" part of the model's response for it. Cannot be used at the same time as `add_generation_prompt`.	`False`
`padding` ¶	`Literal['longest', 'max_length', True]`	Strategy to pad the returned sequences. - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided). - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum acceptable input length for the model if that argument is not provided.	`True`
`truncation` ¶	`bool`	Whether to truncate sequences at the maximum length. If not specified, the tokenizer's `model_max_length` attribute will be used as a default.	`False`
`max_length` ¶	`int \| None`	Maximum length (in tokens) to use for padding or truncation.	`None`
`ignore_prompt_loss` ¶	`bool`	Whether to ignore loss computation for the prompt, which is messages except the last one.	`False`
`transform_all_tokens` ¶	`bool`	Whether to also apply Stained Glass Transform to special tokens.	`False`
`return_attention_mask` ¶	`bool`	Whether to return the attention mask.	`False`
`return_loss_mask` ¶	`bool`	Whether to return the loss mask.	`True`
`tokenizer_kwargs` ¶	`TokenizerKwargs \| None`	Additional kwargs to pass to the tokenizer.	`None`

Returns:

Type	Description
`NoiseEncoding`	A dictionary containing `input_ids`, `noise_mask`, `attention_mask`, and `loss_mask`.

TokenizerKwargs ¶

Bases: TypedDict

Keyword arguments passed to apply_chat_template as tokenizer_kwargs.

noise_tokenizer

NoiseEncoding ¶

NoiseTokenizer ¶

`tokenizer` ¶

apply_chat_template ¶

`conversation` ¶

`chat_template` ¶

`add_generation_prompt` ¶

`continue_final_message` ¶

`padding` ¶

`truncation` ¶

`max_length` ¶

`ignore_prompt_loss` ¶

`transform_all_tokens` ¶

`return_attention_mask` ¶

`return_loss_mask` ¶

`tokenizer_kwargs` ¶

TokenizerKwargs ¶

noise_tokenizer

NoiseEncoding ¶

NoiseTokenizer ¶

tokenizer ¶

apply_chat_template ¶

conversation ¶

chat_template ¶

add_generation_prompt ¶

continue_final_message ¶

padding ¶

truncation ¶

max_length ¶

ignore_prompt_loss ¶

transform_all_tokens ¶

return_attention_mask ¶

return_loss_mask ¶

tokenizer_kwargs ¶

TokenizerKwargs ¶

`tokenizer` ¶

`conversation` ¶

`chat_template` ¶

`add_generation_prompt` ¶

`continue_final_message` ¶

`padding` ¶

`truncation` ¶

`max_length` ¶

`ignore_prompt_loss` ¶

`transform_all_tokens` ¶

`return_attention_mask` ¶

`return_loss_mask` ¶

`tokenizer_kwargs` ¶