llama

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Llama2 and Llama3 prompts for Stained Glass Transform training and testing.

Classes:

Name	Description
`Llama2InstructionTokenizerMapper`	Tokenizes and builds the intermediate tensor components of a prompt.
`Llama3ChatTokenizerMapper`	Tokenizes and builds the intermediate tensor components of a prompt.

Attributes:

Name	Type	Description
`LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE`		The Hugging Face chat template for Llama 2.
`LLAMA_2_SPECIAL_STRINGS`	`Final[InstructionSpecialStrings]`	Special string components of the Llama 2 prompt.
`LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE`		The Hugging Face chat template for Llama 3.
`LLAMA_3_SPECIAL_STRINGS`	`Final[ChatSpecialStrings]`	Special string components of the Llama 3 prompt.

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE `module-attribute` ¶

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE = "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'system' %}{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}"

The Hugging Face chat template for Llama 2. This was removed via https://github.com/huggingface/transformers/pull/31733

LLAMA_2_SPECIAL_STRINGS `module-attribute` ¶

LLAMA_2_SPECIAL_STRINGS: Final[
    InstructionSpecialStrings
] = InstructionSpecialStrings(
    INSTRUCTION_START="[INST]",
    SYSTEM_PROMPT_START="<<SYS>>",
    SYSTEM_PROMPT_END="<</SYS>>\n",
    CONTEXT_START="### INPUT:\n",
    INSTRUCTION_END="[/INST]",
)

Special string components of the Llama 2 prompt.

Based on: https://huggingface.co/blog/llama2#how-to-prompt-llama-2.

The prompt is structured as follows

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE `module-attribute` ¶

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"

The Hugging Face chat template for Llama 3.

For some reason, the tokenizer_config.json distributed by Meta for Llama 3 is configured with "tokenizer_class": "PreTrainedTokenizerFast", which uses the default chat template. Since there is no Llama 3-specific tokenizer class, you can supply this as the chat_template argument to transformers.PreTrainedTokenizer.apply_chat_template to apply the correct chat template.

LLAMA_3_SPECIAL_STRINGS `module-attribute` ¶

LLAMA_3_SPECIAL_STRINGS: Final[ChatSpecialStrings] = (
    ChatSpecialStrings(
        ROLES=ChatRoleStrings(
            SYSTEM_ROLE="system",
            USER_ROLE="user",
            ASSISTANT_ROLE="assistant",
        ),
        ROLE_HEADER_START="<|start_header_id|>",
        ROLE_HEADER_END="<|end_header_id|>\n\n",
        MESSAGE_END="<|eot_id|>",
    )
)

Special string components of the Llama 3 prompt.

Based on: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/.

The prompt is structured as follows

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Llama2InstructionTokenizerMapper `dataclass` ¶

Bases: InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Classes:

Name	Description
`PromptTokens`	Collection of all tokenized components of the prompt.
`SchemaTokens`	Tokenized intermediate prompt schema.
`SpecialTokens`	Tokenized special components of the prompt.

Methods:

Name	Description
`tokenize`	Tokenize the text.

Attributes:

Name	Type	Description
`always_include_context`	`bool`	Whether to always include the start of context tokens in the prompt, even if no context is provided.
`special_tokens`	`SpecialTokens`	The tokenized special prompt strings.
`tokenizer`	`PreTrainedTokenizerBase`	The LLM tokenizer to use.

always_include_context `class-attribute` `instance-attribute` ¶

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens `class-attribute` `instance-attribute` ¶

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

PromptTokens ¶

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name	Type	Description
`schema_tokens`	`SchemaTokens`	The tokenized schema components of the prompt.
`special_tokens`	`SpecialTokens`	The tokenized special components of the prompt.

schema_tokens `instance-attribute` ¶

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens `instance-attribute` ¶

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens ¶

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name	Type	Description
`context`	`Tensor`	An optional context to append to the instruction.
`instruction`	`Tensor`	The input to the model.
`response`	`Tensor`	The expected model response to the instruction.
`system_prompt`	`Tensor`	An optional system prompt for the model.

context `instance-attribute` ¶

context: Tensor

An optional context to append to the instruction.

instruction `instance-attribute` ¶

instruction: Tensor

The input to the model.

response `instance-attribute` ¶

response: Tensor

The expected model response to the instruction.

system_prompt `instance-attribute` ¶

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens ¶

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name	Type	Description
`bos`	`Tensor`	The beginning of string token.
`context_start`	`Tensor`	The delimiter between the instruction and the context.
`eos`	`Tensor`	The end of string token.
`instruction_end`	`Tensor`	The end of the instruction tag.
`instruction_start`	`Tensor`	The start of the instruction tag.
`system_prompt_end`	`Tensor`	The end of the system prompt.
`system_prompt_start`	`Tensor`	The start of the system prompt.

bos `instance-attribute` ¶

bos: Tensor

The beginning of string token.

context_start `instance-attribute` ¶

context_start: Tensor

The delimiter between the instruction and the context.

eos `instance-attribute` ¶

eos: Tensor

The end of string token.

instruction_end `instance-attribute` ¶

instruction_end: Tensor

The end of the instruction tag.

instruction_start `instance-attribute` ¶

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end `instance-attribute` ¶

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start `instance-attribute` ¶

system_prompt_start: Tensor

The start of the system prompt.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text` ¶	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

Llama3ChatTokenizerMapper `dataclass` ¶

Bases: ChatTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Added in version 0.77.0.

Classes:

Name	Description
`PromptTokens`	Collection of all tokenized components of the prompt.
`SchemaTokens`	Tokenized intermediate prompt schema.
`SpecialTokens`	Tokenized special components of the prompt.

Methods:

Name	Description
`tokenize`	Tokenize the text.

Attributes:

Name	Type	Description
`special_tokens`	`SpecialTokens`	The tokenized special prompt strings.
`tokenizer`	`PreTrainedTokenizerBase`	The LLM tokenizer to use.

special_tokens `class-attribute` `instance-attribute` ¶

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

PromptTokens ¶

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name	Type	Description
`schema_tokens`	`list[SchemaTokens]`	The tokenized schema components of the prompt.
`special_tokens`	`SpecialTokens`	The tokenized special components of the prompt.

schema_tokens `instance-attribute` ¶

schema_tokens: list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens `instance-attribute` ¶

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens ¶

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name	Type	Description
`content`	`Tensor`	The content of the message.
`role`	`Tensor`	The role of the message.

content `instance-attribute` ¶

content: Tensor

The content of the message.

role `instance-attribute` ¶

role: Tensor

The role of the message.

SpecialTokens ¶

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name	Type	Description
`assistant_role`	`Tensor`	The assistant role.
`bos`	`Tensor`	The beginning of string token.
`message_end`	`Tensor`	The end of a message.
`role_header_end`	`Tensor`	The end of the role header.
`role_header_start`	`Tensor`	The start of the role header.
`system_role`	`Tensor`	The system role.
`user_role`	`Tensor`	The user role.

assistant_role `instance-attribute` ¶

assistant_role: Tensor

The assistant role.

bos `instance-attribute` ¶

bos: Tensor

The beginning of string token.

message_end `instance-attribute` ¶

message_end: Tensor

The end of a message.

role_header_end `instance-attribute` ¶

role_header_end: Tensor

The end of the role header.

role_header_start `instance-attribute` ¶

role_header_start: Tensor

The start of the role header.

system_role `instance-attribute` ¶

system_role: Tensor

The system role.

user_role `instance-attribute` ¶

user_role: Tensor

The user role.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text` ¶	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

llama

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE module-attribute ¶

LLAMA_2_SPECIAL_STRINGS module-attribute ¶

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE module-attribute ¶

LLAMA_3_SPECIAL_STRINGS module-attribute ¶

Llama2InstructionTokenizerMapper dataclass ¶

always_include_context class-attribute instance-attribute ¶

special_tokens class-attribute instance-attribute ¶

tokenizer instance-attribute ¶

PromptTokens ¶

schema_tokens instance-attribute ¶

special_tokens instance-attribute ¶

SchemaTokens ¶

context instance-attribute ¶

instruction instance-attribute ¶

response instance-attribute ¶

system_prompt instance-attribute ¶

SpecialTokens ¶

bos instance-attribute ¶

context_start instance-attribute ¶

eos instance-attribute ¶

instruction_end instance-attribute ¶

instruction_start instance-attribute ¶

system_prompt_end instance-attribute ¶

system_prompt_start instance-attribute ¶

tokenize ¶

text ¶

Llama3ChatTokenizerMapper dataclass ¶

special_tokens class-attribute instance-attribute ¶

tokenizer instance-attribute ¶

PromptTokens ¶

schema_tokens instance-attribute ¶

special_tokens instance-attribute ¶

SchemaTokens ¶

content instance-attribute ¶

role instance-attribute ¶

SpecialTokens ¶

assistant_role instance-attribute ¶

bos instance-attribute ¶

message_end instance-attribute ¶

role_header_end instance-attribute ¶

role_header_start instance-attribute ¶

system_role instance-attribute ¶

user_role instance-attribute ¶

tokenize ¶

text ¶

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE `module-attribute` ¶

LLAMA_2_SPECIAL_STRINGS `module-attribute` ¶

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE `module-attribute` ¶

LLAMA_3_SPECIAL_STRINGS `module-attribute` ¶

Llama2InstructionTokenizerMapper `dataclass` ¶

always_include_context `class-attribute` `instance-attribute` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

context `instance-attribute` ¶

instruction `instance-attribute` ¶

response `instance-attribute` ¶

system_prompt `instance-attribute` ¶

bos `instance-attribute` ¶

context_start `instance-attribute` ¶

eos `instance-attribute` ¶

instruction_end `instance-attribute` ¶

instruction_start `instance-attribute` ¶

system_prompt_end `instance-attribute` ¶

system_prompt_start `instance-attribute` ¶

`text` ¶

Llama3ChatTokenizerMapper `dataclass` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

content `instance-attribute` ¶

role `instance-attribute` ¶

assistant_role `instance-attribute` ¶

bos `instance-attribute` ¶

message_end `instance-attribute` ¶

role_header_end `instance-attribute` ¶

role_header_start `instance-attribute` ¶

system_role `instance-attribute` ¶

user_role `instance-attribute` ¶

`text` ¶