Skip to content

noise_tokenizer

Classes:

Name Description
NoiseEncoding

A dictionary that contains the tokenized input IDs, noise mask, attention mask, and loss mask.

NoiseTokenizer

Augments tokenized text data with special tokens and masks for noise injection and loss computation.

TokenizerKwargs

Keyword arguments passed to apply_chat_template as tokenizer_kwargs.

NoiseEncoding

Bases: TypedDict

A dictionary that contains the tokenized input IDs, noise mask, attention mask, and loss mask.

NoiseTokenizer

Augments tokenized text data with special tokens and masks for noise injection and loss computation.

Parameters:

Name Type Description Default

tokenizer

PreTrainedTokenizerBase

A Hugging Face tokenizer instance that will be used for tokenization and encoding.

required

Added in version v0.139.0. A simplified api to tokenize messages for Stained Glass Transform.

Methods:

Name Description
apply_chat_template

Apply a chat template to a conversation or a batch of conversations augmenting the input with special tokens and generating

apply_chat_template

apply_chat_template(
    conversation: list[dict[str, str]]
    | list[list[dict[str, str]]],
    chat_template: str | None = None,
    add_generation_prompt: bool = False,
    continue_final_message: bool = False,
    padding: Literal["longest", "max_length", True] = True,
    truncation: bool = False,
    max_length: int | None = None,
    ignore_prompt_loss: bool = False,
    transform_all_tokens: bool = False,
    return_attention_mask: bool = False,
    return_loss_mask: bool = True,
    tokenizer_kwargs: TokenizerKwargs | None = None,
) -> NoiseEncoding

Apply a chat template to a conversation or a batch of conversations augmenting the input with special tokens and generating masks for noise and loss computation.

Warning

Messages that have training/leading whitespace, or that are equal to the empty string are not guaranteed to produce identical token_ids as the underlying tokenizer. This is because some tokenizers (e.g. LlamaTokenizer) strip the message when applying the chat template.

Parameters:

Name Type Description Default

conversation

list[dict[str, str]] | list[list[dict[str, str]]]

A list of dicts with "role" and "content" keys, representing the chat history so far.

required

chat_template

str | None

A Jinja template to use for this conversion. It is usually not necessary to pass anything to this argument, as

None

add_generation_prompt

bool

If this is set, a prompt with the token(s) that indicate the start of an assistant message will be appended to the formatted output. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect.

False

continue_final_message

bool

If this is set, the chat will be formatted so that the final message in the chat is open-ended, without any EOS tokens. The model will continue this message rather than starting a new one. This allows you to "prefill" part of the model's response for it. Cannot be used at the same time as add_generation_prompt.

False

padding

Literal['longest', 'max_length', True]

Strategy to pad the returned sequences. - True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided). - 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.

True

truncation

bool

Whether to truncate sequences at the maximum length. If not specified, the tokenizer's model_max_length attribute will be used as a default.

False

max_length

int | None

Maximum length (in tokens) to use for padding or truncation.

None

ignore_prompt_loss

bool

Whether to ignore loss computation for the prompt, which is messages except the last one.

False

transform_all_tokens

bool

Whether to also apply Stained Glass Transform to special tokens.

False

return_attention_mask

bool

Whether to return the attention mask.

False

return_loss_mask

bool

Whether to return the loss mask.

True

tokenizer_kwargs

TokenizerKwargs | None

Additional kwargs to pass to the tokenizer.

None

Returns:

Type Description
NoiseEncoding

A dictionary containing input_ids, noise_mask, attention_mask, and loss_mask.

TokenizerKwargs

Bases: TypedDict

Keyword arguments passed to apply_chat_template as tokenizer_kwargs.