llama
Mapper
classes (designed to be compatible with datasets.Dataset.map) useful for building Llama2 and Llama3 prompts for Stained Glass Transform
training and testing.
Classes:
Name | Description |
---|---|
Llama2InstructionTokenizerMapper |
Tokenizes and builds the intermediate tensor components of a prompt. |
Llama3ChatTokenizerMapper |
Tokenizes and builds the intermediate tensor components of a prompt. |
Attributes:
Name | Type | Description |
---|---|---|
LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE |
The Hugging Face chat template for Llama 2. |
|
LLAMA_2_SPECIAL_STRINGS |
Final[InstructionSpecialStrings]
|
Special string components of the Llama 2 prompt. |
LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE |
The Hugging Face chat template for Llama 3. |
|
LLAMA_3_SPECIAL_STRINGS |
Final[ChatSpecialStrings]
|
Special string components of the Llama 3 prompt. |
LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE
module-attribute
¶
LLAMA_2_HUGGING_FACE_CHAT_TEMPLATE = "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% elif USE_DEFAULT_PROMPT == true and not '<<SYS>>' in messages[0]['content'] %}{% set loop_messages = messages %}{% set system_message = 'DEFAULT_SYSTEM_MESSAGE' %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'system' %}{{ '<<SYS>>\\n' + content.strip() + '\\n<</SYS>>\\n\\n' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}"
The Hugging Face chat template for Llama 2. This was removed via https://github.com/huggingface/transformers/pull/31733
LLAMA_2_SPECIAL_STRINGS
module-attribute
¶
LLAMA_2_SPECIAL_STRINGS: Final[
InstructionSpecialStrings
] = InstructionSpecialStrings(
INSTRUCTION_START="[INST]",
SYSTEM_PROMPT_START="<<SYS>>",
SYSTEM_PROMPT_END="<</SYS>>\n",
CONTEXT_START="### INPUT:\n",
INSTRUCTION_END="[/INST]",
)
LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE
module-attribute
¶
LLAMA_3_HUGGING_FACE_CHAT_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
The Hugging Face chat template for Llama 3.
For some reason, the tokenizer_config.json
distributed by Meta for Llama 3 is configured with
"tokenizer_class": "PreTrainedTokenizerFast"
, which uses the default chat template. Since there is no Llama 3-specific tokenizer class,
you can supply this as the chat_template
argument to transformers.PreTrainedTokenizer.apply_chat_template
to apply the correct chat
template.
LLAMA_3_SPECIAL_STRINGS
module-attribute
¶
LLAMA_3_SPECIAL_STRINGS: Final[ChatSpecialStrings] = (
ChatSpecialStrings(
ROLES=ChatRoleStrings(
SYSTEM_ROLE="system",
USER_ROLE="user",
ASSISTANT_ROLE="assistant",
),
ROLE_HEADER_START="<|start_header_id|>",
ROLE_HEADER_END="<|end_header_id|>\n\n",
MESSAGE_END="<|eot_id|>",
)
)
Special string components of the Llama 3 prompt.
Based on: https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/.
Llama2InstructionTokenizerMapper
dataclass
¶
Bases: InstructionTokenizerMapper
Tokenizes and builds the intermediate tensor components of a prompt.
Classes:
Name | Description |
---|---|
PromptTokens |
Collection of all tokenized components of the prompt. |
SchemaTokens |
Tokenized intermediate prompt schema. |
SpecialTokens |
Tokenized special components of the prompt. |
Methods:
Name | Description |
---|---|
tokenize |
Tokenize the text. |
Attributes:
Name | Type | Description |
---|---|---|
always_include_context |
bool
|
Whether to always include the start of context tokens in the prompt, even if no context is provided. |
special_tokens |
SpecialTokens
|
The tokenized special prompt strings. |
tokenizer |
PreTrainedTokenizerBase
|
The LLM tokenizer to use. |
always_include_context
class-attribute
instance-attribute
¶
always_include_context: bool = False
Whether to always include the start of context tokens in the prompt, even if no context is provided.
special_tokens
class-attribute
instance-attribute
¶
special_tokens: SpecialTokens = field(init=False)
The tokenized special prompt strings.
PromptTokens
¶
Bases: TypedDict
Collection of all tokenized components of the prompt.
Attributes:
Name | Type | Description |
---|---|---|
schema_tokens |
SchemaTokens
|
The tokenized schema components of the prompt. |
special_tokens |
SpecialTokens
|
The tokenized special components of the prompt. |
schema_tokens
instance-attribute
¶
schema_tokens: SchemaTokens
The tokenized schema components of the prompt.
special_tokens
instance-attribute
¶
special_tokens: SpecialTokens
The tokenized special components of the prompt.
SchemaTokens
¶
Bases: TypedDict
Tokenized intermediate prompt schema.
Attributes:
Name | Type | Description |
---|---|---|
context |
Tensor
|
An optional context to append to the instruction. |
instruction |
Tensor
|
The input to the model. |
response |
Tensor
|
The expected model response to the instruction. |
system_prompt |
Tensor
|
An optional system prompt for the model. |
SpecialTokens
¶
Bases: TypedDict
Tokenized special components of the prompt.
Attributes:
Name | Type | Description |
---|---|---|
bos |
Tensor
|
The beginning of string token. |
context_start |
Tensor
|
The delimiter between the instruction and the context. |
eos |
Tensor
|
The end of string token. |
instruction_end |
Tensor
|
The end of the instruction tag. |
instruction_start |
Tensor
|
The start of the instruction tag. |
system_prompt_end |
Tensor
|
The end of the system prompt. |
system_prompt_start |
Tensor
|
The start of the system prompt. |
Llama3ChatTokenizerMapper
dataclass
¶
Bases: ChatTokenizerMapper
Tokenizes and builds the intermediate tensor components of a prompt.
Added in version 0.77.0.
Classes:
Name | Description |
---|---|
PromptTokens |
Collection of all tokenized components of the prompt. |
SchemaTokens |
Tokenized intermediate prompt schema. |
SpecialTokens |
Tokenized special components of the prompt. |
Methods:
Name | Description |
---|---|
tokenize |
Tokenize the text. |
Attributes:
Name | Type | Description |
---|---|---|
special_tokens |
SpecialTokens
|
The tokenized special prompt strings. |
tokenizer |
PreTrainedTokenizerBase
|
The LLM tokenizer to use. |
special_tokens
class-attribute
instance-attribute
¶
special_tokens: SpecialTokens = field(init=False)
The tokenized special prompt strings.
PromptTokens
¶
Bases: TypedDict
Collection of all tokenized components of the prompt.
Attributes:
Name | Type | Description |
---|---|---|
schema_tokens |
list[SchemaTokens]
|
The tokenized schema components of the prompt. |
special_tokens |
SpecialTokens
|
The tokenized special components of the prompt. |
schema_tokens
instance-attribute
¶
schema_tokens: list[SchemaTokens]
The tokenized schema components of the prompt.
special_tokens
instance-attribute
¶
special_tokens: SpecialTokens
The tokenized special components of the prompt.
SchemaTokens
¶
SpecialTokens
¶
Bases: TypedDict
Tokenized special components of the prompt.
Attributes:
Name | Type | Description |
---|---|---|
assistant_role |
Tensor
|
The assistant role. |
bos |
Tensor
|
The beginning of string token. |
message_end |
Tensor
|
The end of a message. |
role_header_end |
Tensor
|
The end of the role header. |
role_header_start |
Tensor
|
The start of the role header. |
system_role |
Tensor
|
The system role. |
user_role |
Tensor
|
The user role. |