Skip to content

llama

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Llama2 and Llama3 prompts for Stained Glass Transform training and testing.

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE module-attribute

LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE = "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'system' %}{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}"

The Hugging Face chat template for Llama 2. This was removed via https://github.com/huggingface/transformers/pull/31733

LLAMA_2_SPECIAL_STRINGS module-attribute

LLAMA_2_SPECIAL_STRINGS: Final[InstructionSpecialStrings] = InstructionSpecialStrings(INSTRUCTION_START='[INST]', SYSTEM_PROMPT_START='<<SYS>>', SYSTEM_PROMPT_END='<</SYS>>\n', CONTEXT_START='### INPUT:\n', INSTRUCTION_END='[/INST]')

Special string components of the Llama 2 prompt.

Based on: https://huggingface.co/blog/llama2#how-to-prompt-llama-2.

The prompt is structured as follows
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>

{{ user_message }} [/INST]

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE module-attribute

LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"

The Hugging Face chat template for Llama 3.

For some reason, the tokenizer_config.json distributed by Meta for Llama 3 is configured with "tokenizer_class": "PreTrainedTokenizerFast", which uses the default chat template. Since there is no Llama 3-specific tokenizer class, you can supply this as the chat_template argument to transformers.PreTrainedTokenizer.apply_chat_template to apply the correct chat template.

LLAMA_3_SPECIAL_STRINGS module-attribute

LLAMA_3_SPECIAL_STRINGS: Final[ChatSpecialStrings] = ChatSpecialStrings(ROLES=ChatRoleStrings(SYSTEM_ROLE='system', USER_ROLE='user', ASSISTANT_ROLE='assistant'), ROLE_HEADER_START='<|start_header_id|>', ROLE_HEADER_END='<|end_header_id|>\n\n', MESSAGE_END='<|eot_id|>')

Special string components of the Llama 3 prompt.

Based on: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/.

The prompt is structured as follows
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_message }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Llama2InstructionTokenizerMapper dataclass

Bases: InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

always_include_context class-attribute instance-attribute

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

context instance-attribute

context: Tensor

An optional context to append to the instruction.

instruction instance-attribute

instruction: Tensor

The input to the model.

response instance-attribute

response: Tensor

The expected model response to the instruction.

system_prompt instance-attribute

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

bos instance-attribute

bos: Tensor

The beginning of string token.

context_start instance-attribute

context_start: Tensor

The delimiter between the instruction and the context.

eos instance-attribute

eos: Tensor

The end of string token.

instruction_end instance-attribute

instruction_end: Tensor

The end of the instruction tag.

instruction_start instance-attribute

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end instance-attribute

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start instance-attribute

system_prompt_start: Tensor

The start of the system prompt.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default
text str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

Llama3ChatTokenizerMapper dataclass

Bases: ChatTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Added in version 0.77.0.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

schema_tokens instance-attribute

schema_tokens: list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

content instance-attribute

content: Tensor

The content of the message.

role instance-attribute

role: Tensor

The role of the message.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

assistant_role instance-attribute

assistant_role: Tensor

The assistant role.

bos instance-attribute

bos: Tensor

The beginning of string token.

message_end instance-attribute

message_end: Tensor

The end of a message.

role_header_end instance-attribute

role_header_end: Tensor

The end of the role header.

role_header_start instance-attribute

role_header_start: Tensor

The start of the role header.

system_role instance-attribute

system_role: Tensor

The system role.

user_role instance-attribute

user_role: Tensor

The user role.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default
text str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.