mistral

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Mistral prompts for Stained Glass Transform training and testing.

Classes:

Name	Description
`MistralInstructionTokenizerMapper`	Tokenizes and builds the intermediate tensor components of a prompt.

Attributes:

Name	Type	Description
`MISTRAL_SPECIAL_STRINGS`	`Final[InstructionSpecialStrings]`	Special string components of the Mistral prompt.

MISTRAL_SPECIAL_STRINGS `module-attribute` ¶

MISTRAL_SPECIAL_STRINGS: Final[
    InstructionSpecialStrings
] = InstructionSpecialStrings(
    INSTRUCTION_START="[INST]",
    SYSTEM_PROMPT_START="",
    SYSTEM_PROMPT_END="",
    CONTEXT_START="###",
    INSTRUCTION_END="[/INST]",
)

Special string components of the Mistral prompt.

Based on: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt is structured as follows

<s>[INST] {{ user_message }} [/INST]

Mistral makes no distinction between system prompts, prompts, or bodies/context in its instruction tuning format. Characters like ### and <<>> are used to specify boundaries between sections of text. https://docs.mistral.ai/guides/prompting-capabilities/ https://www.promptingguide.ai/models/mistral-7b#mistral-7b-instruct

MistralInstructionTokenizerMapper `dataclass` ¶

Bases: InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Classes:

Name	Description
`PromptTokens`	Collection of all tokenized components of the prompt.
`SchemaTokens`	Tokenized intermediate prompt schema.
`SpecialTokens`	Tokenized special components of the prompt.

Methods:

Name	Description
`tokenize`	Tokenize the text.

Attributes:

Name	Type	Description
`always_include_context`	`bool`	Whether to always include the start of context tokens in the prompt, even if no context is provided.
`special_tokens`	`SpecialTokens`	The tokenized special prompt strings.
`tokenizer`	`PreTrainedTokenizerBase`	The LLM tokenizer to use.

always_include_context `class-attribute` `instance-attribute` ¶

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens `class-attribute` `instance-attribute` ¶

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

PromptTokens ¶

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name	Type	Description
`schema_tokens`	`SchemaTokens`	The tokenized schema components of the prompt.
`special_tokens`	`SpecialTokens`	The tokenized special components of the prompt.

schema_tokens `instance-attribute` ¶

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens `instance-attribute` ¶

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens ¶

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name	Type	Description
`context`	`Tensor`	An optional context to append to the instruction.
`instruction`	`Tensor`	The input to the model.
`response`	`Tensor`	The expected model response to the instruction.
`system_prompt`	`Tensor`	An optional system prompt for the model.

context `instance-attribute` ¶

context: Tensor

An optional context to append to the instruction.

instruction `instance-attribute` ¶

instruction: Tensor

The input to the model.

response `instance-attribute` ¶

response: Tensor

The expected model response to the instruction.

system_prompt `instance-attribute` ¶

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens ¶

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name	Type	Description
`bos`	`Tensor`	The beginning of string token.
`context_start`	`Tensor`	The delimiter between the instruction and the context.
`eos`	`Tensor`	The end of string token.
`instruction_end`	`Tensor`	The end of the instruction tag.
`instruction_start`	`Tensor`	The start of the instruction tag.
`system_prompt_end`	`Tensor`	The end of the system prompt.
`system_prompt_start`	`Tensor`	The start of the system prompt.

bos `instance-attribute` ¶

bos: Tensor

The beginning of string token.

context_start `instance-attribute` ¶

context_start: Tensor

The delimiter between the instruction and the context.

eos `instance-attribute` ¶

eos: Tensor

The end of string token.

instruction_end `instance-attribute` ¶

instruction_end: Tensor

The end of the instruction tag.

instruction_start `instance-attribute` ¶

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end `instance-attribute` ¶

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start `instance-attribute` ¶

system_prompt_start: Tensor

The start of the system prompt.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text` ¶	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

MistralMultiturnTransformLayerMapper ¶

Based on https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt for multiturn chats is structured as follows

<s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s> ...

Added in version 0.87.0.

Changed in version 0.97.0: Messages can optionally start with a system prompt.

Methods:

Name	Description
`__call__`	Tokenizes and builds the intermediate tensor components of a multiturn prompt.
`__init__`	Tokenizes and builds tensors for `TransformLayer` from multiturn messages.

call ¶

__call__(
    sample: Sequence[Schema],
) -> universal.UndifferentiatedTransformLayerInput

Tokenizes and builds the intermediate tensor components of a multiturn prompt.

Parameters:

Name	Type	Description	Default
`sample` ¶	`Sequence[Schema]`	A sequence of messages in the conversation. This argument name is used for consistency with other `Mapper` classes.	required

Returns:

Type	Description
`universal.UndifferentiatedTransformLayerInput`	The intermediate tensor components of the multiturn prompt.

init ¶

__init__(tokenizer: PreTrainedTokenizerBase) -> None

Tokenizes and builds tensors for TransformLayer from multiturn messages.

Parameters:

Name	Type	Description	Default
`tokenizer` ¶	`PreTrainedTokenizerBase`	The tokenizer to use for tokenization.	required

mistral

MISTRAL_SPECIAL_STRINGS `module-attribute` ¶

MistralInstructionTokenizerMapper `dataclass` ¶

always_include_context `class-attribute` `instance-attribute` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

PromptTokens ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

SchemaTokens ¶

context `instance-attribute` ¶

instruction `instance-attribute` ¶

response `instance-attribute` ¶

system_prompt `instance-attribute` ¶

SpecialTokens ¶

bos `instance-attribute` ¶

context_start `instance-attribute` ¶

eos `instance-attribute` ¶

instruction_end `instance-attribute` ¶

instruction_start `instance-attribute` ¶

system_prompt_end `instance-attribute` ¶

system_prompt_start `instance-attribute` ¶

tokenize ¶

`text` ¶

MistralMultiturnTransformLayerMapper ¶

call ¶

`sample` ¶

init ¶

`tokenizer` ¶

mistral

MISTRAL_SPECIAL_STRINGS module-attribute ¶

MistralInstructionTokenizerMapper dataclass ¶

always_include_context class-attribute instance-attribute ¶

special_tokens class-attribute instance-attribute ¶

tokenizer instance-attribute ¶

PromptTokens ¶

schema_tokens instance-attribute ¶

special_tokens instance-attribute ¶

SchemaTokens ¶

context instance-attribute ¶

instruction instance-attribute ¶

response instance-attribute ¶

system_prompt instance-attribute ¶

SpecialTokens ¶

bos instance-attribute ¶

context_start instance-attribute ¶

eos instance-attribute ¶

instruction_end instance-attribute ¶

instruction_start instance-attribute ¶

system_prompt_end instance-attribute ¶

system_prompt_start instance-attribute ¶

tokenize ¶

text ¶

MistralMultiturnTransformLayerMapper ¶

__call__ ¶

sample ¶

__init__ ¶

tokenizer ¶

MISTRAL_SPECIAL_STRINGS `module-attribute` ¶

MistralInstructionTokenizerMapper `dataclass` ¶

always_include_context `class-attribute` `instance-attribute` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

context `instance-attribute` ¶

instruction `instance-attribute` ¶

response `instance-attribute` ¶

system_prompt `instance-attribute` ¶

bos `instance-attribute` ¶

context_start `instance-attribute` ¶

eos `instance-attribute` ¶

instruction_end `instance-attribute` ¶

instruction_start `instance-attribute` ¶

system_prompt_end `instance-attribute` ¶

system_prompt_start `instance-attribute` ¶

`text` ¶

call ¶

`sample` ¶

init ¶

`tokenizer` ¶