Skip to content

mistral

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Mistral prompts for Stained Glass Transform training and testing.

MISTRAL_SPECIAL_STRINGS module-attribute

MISTRAL_SPECIAL_STRINGS: Final[InstructionSpecialStrings] = InstructionSpecialStrings(INSTRUCTION_START='[INST]', SYSTEM_PROMPT_START='', SYSTEM_PROMPT_END='', CONTEXT_START='###', INSTRUCTION_END='[/INST]')

Special string components of the Mistral prompt.

Based on: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt is structured as follows
<s>[INST] {{ user_message }} [/INST]

Mistral makes no distinction between system prompts, prompts, or bodies/context in its instruction tuning format. Characters like ### and <<>> are used to specify boundaries between sections of text. https://docs.mistral.ai/guides/prompting-capabilities/ https://www.promptingguide.ai/models/mistral-7b#mistral-7b-instruct

MistralInstructionTokenizerMapper dataclass

Bases: InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

always_include_context class-attribute instance-attribute

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

context instance-attribute

context: Tensor

An optional context to append to the instruction.

instruction instance-attribute

instruction: Tensor

The input to the model.

response instance-attribute

response: Tensor

The expected model response to the instruction.

system_prompt instance-attribute

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

bos instance-attribute

bos: Tensor

The beginning of string token.

context_start instance-attribute

context_start: Tensor

The delimiter between the instruction and the context.

eos instance-attribute

eos: Tensor

The end of string token.

instruction_end instance-attribute

instruction_end: Tensor

The end of the instruction tag.

instruction_start instance-attribute

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end instance-attribute

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start instance-attribute

system_prompt_start: Tensor

The start of the system prompt.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default
text str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

MistralMultiturnTransformLayerMapper

Based on https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt for multiturn chats is structured as follows
<s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s> ...

Added in version 0.87.0.

Changed in version 0.97.0: Messages can optionally start with a system prompt.

__call__

__call__(sample: Sequence[Schema]) -> universal.UndifferentiatedTransformLayerInput

Tokenizes and builds the intermediate tensor components of a multiturn prompt.

Parameters:

Name Type Description Default
sample Sequence[Schema]

A sequence of messages in the conversation. This argument name is used for consistency with other Mapper classes.

required

Returns:

Type Description
universal.UndifferentiatedTransformLayerInput

The intermediate tensor components of the multiturn prompt.

__init__

__init__(tokenizer: PreTrainedTokenizerBase) -> None

Tokenizes and builds tensors for TransformLayer from multiturn messages.

Parameters:

Name Type Description Default
tokenizer PreTrainedTokenizerBase

The tokenizer to use for tokenization.

required