Skip to content

hermes

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Hermes prompts for Stained Glass Transform training and testing.

Classes:

Name Description
HermesChatTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Attributes:

Name Type Description
HERMES_SPECIAL_STRINGS Final[ChatSpecialStrings]

Special string components of the Hermes prompt.

HERMES_SPECIAL_STRINGS module-attribute

HERMES_SPECIAL_STRINGS: Final[ChatSpecialStrings] = (
    ChatSpecialStrings(
        ROLES=ChatRoleStrings(
            SYSTEM_ROLE="system",
            USER_ROLE="user",
            ASSISTANT_ROLE="assistant",
        ),
        ROLE_HEADER_START="<|im_start|>",
        ROLE_HEADER_END="\n",
        MESSAGE_END="<|im_end|>\n",
    )
)

Special string components of the Hermes prompt.

Based on the Hugging Face Hub chat template for 'NousResearch/Hermes-3-Llama-3.1-8B'.

The prompt is structured as follows
{{bos_token}}{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system
You are a helpful assistant.<|im_end|>
' }}{% endif %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}

HermesChatTokenizerMapper dataclass

Bases: ChatTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Added in version 0.104.0.

Classes:

Name Description
PromptTokens

Collection of all tokenized components of the prompt.

SchemaTokens

Tokenized intermediate prompt schema.

SpecialTokens

Tokenized special components of the prompt.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
special_tokens SpecialTokens

The tokenized special prompt strings.

tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name Type Description
schema_tokens list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens SpecialTokens

The tokenized special components of the prompt.

schema_tokens instance-attribute

schema_tokens: list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name Type Description
content Tensor

The content of the message.

role Tensor

The role of the message.

content instance-attribute

content: Tensor

The content of the message.

role instance-attribute

role: Tensor

The role of the message.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name Type Description
assistant_role Tensor

The assistant role.

bos Tensor

The beginning of string token.

message_end Tensor

The end of a message.

role_header_end Tensor

The end of the role header.

role_header_start Tensor

The start of the role header.

system_role Tensor

The system role.

user_role Tensor

The user role.

assistant_role instance-attribute

assistant_role: Tensor

The assistant role.

bos instance-attribute

bos: Tensor

The beginning of string token.

message_end instance-attribute

message_end: Tensor

The end of a message.

role_header_end instance-attribute

role_header_end: Tensor

The end of the role header.

role_header_start instance-attribute

role_header_start: Tensor

The start of the role header.

system_role instance-attribute

system_role: Tensor

The system role.

user_role instance-attribute

user_role: Tensor

The user role.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.