noise_tokenizer
Classes:
Name | Description |
---|---|
NoiseEncoding |
A dictionary that contains the tokenized input IDs, noise mask, attention mask, and loss mask. |
NoiseTokenizer |
Augments tokenized text data with special tokens and masks for noise injection and loss computation. |
TokenizerKwargs |
Keyword arguments passed to |
NoiseEncoding
¶
Bases: TypedDict
A dictionary that contains the tokenized input IDs, noise mask, attention mask, and loss mask.
NoiseTokenizer
¶
Augments tokenized text data with special tokens and masks for noise injection and loss computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
PreTrainedTokenizerBase
|
A Hugging Face tokenizer instance that will be used for tokenization and encoding. |
required |
Added in version v0.139.0. A simplified api to tokenize messages for Stained Glass Transform.
Methods:
Name | Description |
---|---|
apply_chat_template |
Apply a chat template to a conversation or a batch of conversations augmenting the input with special tokens and generating |
apply_chat_template
¶
apply_chat_template(
conversation: list[dict[str, str]]
| list[list[dict[str, str]]],
chat_template: str | None = None,
add_generation_prompt: bool = False,
continue_final_message: bool = False,
padding: Literal["longest", "max_length", True] = True,
truncation: bool = False,
max_length: int | None = None,
ignore_prompt_loss: bool = False,
transform_all_tokens: bool = False,
return_attention_mask: bool = False,
return_loss_mask: bool = True,
tokenizer_kwargs: TokenizerKwargs | None = None,
) -> NoiseEncoding
Apply a chat template to a conversation or a batch of conversations augmenting the input with special tokens and generating masks for noise and loss computation.
Warning
Messages that have training/leading whitespace, or that are equal to the empty string are not guaranteed to produce identical
token_ids
as the underlying tokenizer. This is because some tokenizers (e.g. LlamaTokenizer
) strip the message when applying
the chat template.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
list[dict[str, str]] | list[list[dict[str, str]]]
|
A list of dicts with "role" and "content" keys, representing the chat history so far. |
required |
|
str | None
|
A Jinja template to use for this conversion. It is usually not necessary to pass anything to this argument, as |
None
|
|
bool
|
If this is set, a prompt with the token(s) that indicate the start of an assistant message will be appended to the formatted output. This is useful when you want to generate a response from the model. Note that this argument will be passed to the chat template, and so it must be supported in the template for this argument to have any effect. |
False
|
|
bool
|
If this is set, the chat will be formatted so that the final message in the chat is open-ended, without
any EOS tokens. The model will continue this message rather than starting a new one. This allows you to "prefill" part of
the model's response for it. Cannot be used at the same time as |
False
|
|
Literal['longest', 'max_length', True]
|
Strategy to pad the returned sequences.
- |
True
|
|
bool
|
Whether to truncate sequences at the maximum length. If not specified, the tokenizer's |
False
|
|
int | None
|
Maximum length (in tokens) to use for padding or truncation. |
None
|
|
bool
|
Whether to ignore loss computation for the prompt, which is messages except the last one. |
False
|
|
bool
|
Whether to also apply Stained Glass Transform to special tokens. |
False
|
|
bool
|
Whether to return the attention mask. |
False
|
|
bool
|
Whether to return the loss mask. |
True
|
|
TokenizerKwargs | None
|
Additional kwargs to pass to the tokenizer. |
None
|
Returns:
Type | Description |
---|---|
NoiseEncoding
|
A dictionary containing |
TokenizerKwargs
¶
Bases: TypedDict
Keyword arguments passed to apply_chat_template
as tokenizer_kwargs
.