mistral

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Mistral prompts for Stained Glass Transform training and testing.

MISTRAL_SPECIAL_STRINGS `module-attribute` ¶

MISTRAL_SPECIAL_STRINGS: Final[InstructionSpecialStrings] = InstructionSpecialStrings(INSTRUCTION_START='[INST]', SYSTEM_PROMPT_START='', SYSTEM_PROMPT_END='', CONTEXT_START='###', INSTRUCTION_END='[/INST]')

Special string components of the Mistral prompt.

Based on: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt is structured as follows

<s>[INST] {{ user_message }} [/INST]

Mistral makes no distinction between system prompts, prompts, or bodies/context in its instruction tuning format. Characters like ### and <<>> are used to specify boundaries between sections of text. https://docs.mistral.ai/guides/prompting-capabilities/ https://www.promptingguide.ai/models/mistral-7b#mistral-7b-instruct

MistralInstructionTokenizerMapper `dataclass` ¶

Bases: InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

always_include_context `class-attribute` `instance-attribute` ¶

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens `class-attribute` `instance-attribute` ¶

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

PromptTokens ¶

Bases: TypedDict

Collection of all tokenized components of the prompt.

schema_tokens `instance-attribute` ¶

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens `instance-attribute` ¶

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens ¶

Bases: TypedDict

Tokenized intermediate prompt schema.

context `instance-attribute` ¶

context: Tensor

An optional context to append to the instruction.

instruction `instance-attribute` ¶

instruction: Tensor

The input to the model.

response `instance-attribute` ¶

response: Tensor

The expected model response to the instruction.

system_prompt `instance-attribute` ¶

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens ¶

Bases: TypedDict

Tokenized special components of the prompt.

bos `instance-attribute` ¶

bos: Tensor

The beginning of string token.

context_start `instance-attribute` ¶

context_start: Tensor

The delimiter between the instruction and the context.

eos `instance-attribute` ¶

eos: Tensor

The end of string token.

instruction_end `instance-attribute` ¶

instruction_end: Tensor

The end of the instruction tag.

instruction_start `instance-attribute` ¶

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end `instance-attribute` ¶

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start `instance-attribute` ¶

system_prompt_start: Tensor

The start of the system prompt.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

MistralMultiturnTransformLayerMapper ¶

Based on https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt for multiturn chats is structured as follows

<s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s> ...

Added in version 0.87.0.

Changed in version 0.97.0: Messages can optionally start with a system prompt.

call ¶

__call__(sample: Sequence[Schema]) -> universal.UndifferentiatedTransformLayerInput

Tokenizes and builds the intermediate tensor components of a multiturn prompt.

Parameters:

Name	Type	Description	Default
`sample`	`Sequence[Schema]`	A sequence of messages in the conversation. This argument name is used for consistency with other `Mapper` classes.	required

Returns:

Type	Description
`universal.UndifferentiatedTransformLayerInput`	The intermediate tensor components of the multiturn prompt.

init ¶

__init__(tokenizer: PreTrainedTokenizerBase) -> None

Tokenizes and builds tensors for TransformLayer from multiturn messages.

Parameters:

Name	Type	Description	Default
`tokenizer`	`PreTrainedTokenizerBase`	The tokenizer to use for tokenization.	required

mistral

MISTRAL_SPECIAL_STRINGS module-attribute ¶

MistralInstructionTokenizerMapper dataclass ¶

always_include_context class-attribute instance-attribute ¶

special_tokens class-attribute instance-attribute ¶

tokenizer instance-attribute ¶

PromptTokens ¶

schema_tokens instance-attribute ¶

special_tokens instance-attribute ¶

SchemaTokens ¶

context instance-attribute ¶

instruction instance-attribute ¶

response instance-attribute ¶

system_prompt instance-attribute ¶

SpecialTokens ¶

bos instance-attribute ¶

context_start instance-attribute ¶

eos instance-attribute ¶

instruction_end instance-attribute ¶

instruction_start instance-attribute ¶

system_prompt_end instance-attribute ¶

system_prompt_start instance-attribute ¶

tokenize ¶

MistralMultiturnTransformLayerMapper ¶

__call__ ¶

__init__ ¶

MISTRAL_SPECIAL_STRINGS `module-attribute` ¶

MistralInstructionTokenizerMapper `dataclass` ¶

always_include_context `class-attribute` `instance-attribute` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

context `instance-attribute` ¶

instruction `instance-attribute` ¶

response `instance-attribute` ¶

system_prompt `instance-attribute` ¶

bos `instance-attribute` ¶

context_start `instance-attribute` ¶

eos `instance-attribute` ¶

instruction_end `instance-attribute` ¶

instruction_start `instance-attribute` ¶

system_prompt_end `instance-attribute` ¶

system_prompt_start `instance-attribute` ¶

call ¶

init ¶