Skip to content

llama

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Llama2 and Llama3 prompts for Stained Glass Transform training and testing.

Classes:

Name Description
Llama2InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Llama3ChatTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Attributes:

Name Type Description
LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE

The Hugging Face chat template for Llama 2.

LLAMA_2_SPECIAL_STRINGS Final[InstructionSpecialStrings]

Special string components of the Llama 2 prompt.

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE

The Hugging Face chat template for Llama 3.

LLAMA_3_SPECIAL_STRINGS Final[ChatSpecialStrings]

Special string components of the Llama 3 prompt.

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE module-attribute

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE = "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'system' %}{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}"

The Hugging Face chat template for Llama 2. This was removed via https://github.com/huggingface/transformers/pull/31733

LLAMA_2_SPECIAL_STRINGS module-attribute

LLAMA_2_SPECIAL_STRINGS: Final[
    InstructionSpecialStrings
] = InstructionSpecialStrings(
    INSTRUCTION_START="[INST]",
    SYSTEM_PROMPT_START="<<SYS>>",
    SYSTEM_PROMPT_END="<</SYS>>\n",
    CONTEXT_START="### INPUT:\n",
    INSTRUCTION_END="[/INST]",
)

Special string components of the Llama 2 prompt.

Based on: https://huggingface.co/blog/llama2#how-to-prompt-llama-2.

The prompt is structured as follows
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE module-attribute

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"

The Hugging Face chat template for Llama 3.

For some reason, the tokenizer_config.json distributed by Meta for Llama 3 is configured with "tokenizer_class": "PreTrainedTokenizerFast", which uses the default chat template. Since there is no Llama 3-specific tokenizer class, you can supply this as the chat_template argument to transformers.PreTrainedTokenizer.apply_chat_template to apply the correct chat template.

LLAMA_3_SPECIAL_STRINGS module-attribute

LLAMA_3_SPECIAL_STRINGS: Final[ChatSpecialStrings] = (
    ChatSpecialStrings(
        ROLES=ChatRoleStrings(
            SYSTEM_ROLE="system",
            USER_ROLE="user",
            ASSISTANT_ROLE="assistant",
        ),
        ROLE_HEADER_START="<|start_header_id|>",
        ROLE_HEADER_END="<|end_header_id|>\n\n",
        MESSAGE_END="<|eot_id|>",
    )
)

Special string components of the Llama 3 prompt.

Based on: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/.

The prompt is structured as follows
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Llama2InstructionTokenizerMapper dataclass

Bases: InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Classes:

Name Description
PromptTokens

Collection of all tokenized components of the prompt.

SchemaTokens

Tokenized intermediate prompt schema.

SpecialTokens

Tokenized special components of the prompt.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
always_include_context bool

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens SpecialTokens

The tokenized special prompt strings.

tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

always_include_context class-attribute instance-attribute

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name Type Description
schema_tokens SchemaTokens

The tokenized schema components of the prompt.

special_tokens SpecialTokens

The tokenized special components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name Type Description
context Tensor

An optional context to append to the instruction.

instruction Tensor

The input to the model.

response Tensor

The expected model response to the instruction.

system_prompt Tensor

An optional system prompt for the model.

context instance-attribute

context: Tensor

An optional context to append to the instruction.

instruction instance-attribute

instruction: Tensor

The input to the model.

response instance-attribute

response: Tensor

The expected model response to the instruction.

system_prompt instance-attribute

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name Type Description
bos Tensor

The beginning of string token.

context_start Tensor

The delimiter between the instruction and the context.

eos Tensor

The end of string token.

instruction_end Tensor

The end of the instruction tag.

instruction_start Tensor

The start of the instruction tag.

system_prompt_end Tensor

The end of the system prompt.

system_prompt_start Tensor

The start of the system prompt.

bos instance-attribute

bos: Tensor

The beginning of string token.

context_start instance-attribute

context_start: Tensor

The delimiter between the instruction and the context.

eos instance-attribute

eos: Tensor

The end of string token.

instruction_end instance-attribute

instruction_end: Tensor

The end of the instruction tag.

instruction_start instance-attribute

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end instance-attribute

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start instance-attribute

system_prompt_start: Tensor

The start of the system prompt.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

Llama3ChatTokenizerMapper dataclass

Bases: ChatTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Added in version 0.77.0.

Classes:

Name Description
PromptTokens

Collection of all tokenized components of the prompt.

SchemaTokens

Tokenized intermediate prompt schema.

SpecialTokens

Tokenized special components of the prompt.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
special_tokens SpecialTokens

The tokenized special prompt strings.

tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name Type Description
schema_tokens list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens SpecialTokens

The tokenized special components of the prompt.

schema_tokens instance-attribute

schema_tokens: list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name Type Description
content Tensor

The content of the message.

role Tensor

The role of the message.

content instance-attribute

content: Tensor

The content of the message.

role instance-attribute

role: Tensor

The role of the message.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name Type Description
assistant_role Tensor

The assistant role.

bos Tensor

The beginning of string token.

message_end Tensor

The end of a message.

role_header_end Tensor

The end of the role header.

role_header_start Tensor

The start of the role header.

system_role Tensor

The system role.

user_role Tensor

The user role.

assistant_role instance-attribute

assistant_role: Tensor

The assistant role.

bos instance-attribute

bos: Tensor

The beginning of string token.

message_end instance-attribute

message_end: Tensor

The end of a message.

role_header_end instance-attribute

role_header_end: Tensor

The end of the role header.

role_header_start instance-attribute

role_header_start: Tensor

The start of the role header.

system_role instance-attribute

system_role: Tensor

The system role.

user_role instance-attribute

user_role: Tensor

The user role.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.