Skip to content

mistral

Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building Mistral prompts for Stained Glass Transform training and testing.

Classes:

Name Description
MistralInstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Attributes:

Name Type Description
MISTRAL_SPECIAL_STRINGS Final[InstructionSpecialStrings]

Special string components of the Mistral prompt.

MISTRAL_SPECIAL_STRINGS module-attribute

MISTRAL_SPECIAL_STRINGS: Final[
    InstructionSpecialStrings
] = InstructionSpecialStrings(
    INSTRUCTION_START="[INST]",
    SYSTEM_PROMPT_START="",
    SYSTEM_PROMPT_END="",
    CONTEXT_START="###",
    INSTRUCTION_END="[/INST]",
)

Special string components of the Mistral prompt.

Based on: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt is structured as follows
<s>[INST] {{ user_message }} [/INST]

Mistral makes no distinction between system prompts, prompts, or bodies/context in its instruction tuning format. Characters like ### and <<>> are used to specify boundaries between sections of text. https://docs.mistral.ai/guides/prompting-capabilities/ https://www.promptingguide.ai/models/mistral-7b#mistral-7b-instruct

MistralInstructionTokenizerMapper dataclass

Bases: InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

Classes:

Name Description
PromptTokens

Collection of all tokenized components of the prompt.

SchemaTokens

Tokenized intermediate prompt schema.

SpecialTokens

Tokenized special components of the prompt.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
always_include_context bool

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens SpecialTokens

The tokenized special prompt strings.

tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

always_include_context class-attribute instance-attribute

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name Type Description
schema_tokens SchemaTokens

The tokenized schema components of the prompt.

special_tokens SpecialTokens

The tokenized special components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name Type Description
context Tensor

An optional context to append to the instruction.

instruction Tensor

The input to the model.

response Tensor

The expected model response to the instruction.

system_prompt Tensor

An optional system prompt for the model.

context instance-attribute

context: Tensor

An optional context to append to the instruction.

instruction instance-attribute

instruction: Tensor

The input to the model.

response instance-attribute

response: Tensor

The expected model response to the instruction.

system_prompt instance-attribute

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name Type Description
bos Tensor

The beginning of string token.

context_start Tensor

The delimiter between the instruction and the context.

eos Tensor

The end of string token.

instruction_end Tensor

The end of the instruction tag.

instruction_start Tensor

The start of the instruction tag.

system_prompt_end Tensor

The end of the system prompt.

system_prompt_start Tensor

The start of the system prompt.

bos instance-attribute

bos: Tensor

The beginning of string token.

context_start instance-attribute

context_start: Tensor

The delimiter between the instruction and the context.

eos instance-attribute

eos: Tensor

The end of string token.

instruction_end instance-attribute

instruction_end: Tensor

The end of the instruction tag.

instruction_start instance-attribute

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end instance-attribute

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start instance-attribute

system_prompt_start: Tensor

The start of the system prompt.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

MistralMultiturnTransformLayerMapper

Based on https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.

The prompt for multiturn chats is structured as follows
<s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s>[INST] {{ user_message }} [/INST] {{ assistant_message }}</s> ...

Added in version 0.87.0.

Changed in version 0.97.0: Messages can optionally start with a system prompt.

Methods:

Name Description
__call__

Tokenizes and builds the intermediate tensor components of a multiturn prompt.

__init__

Tokenizes and builds tensors for TransformLayer from multiturn messages.

__call__

__call__(
    sample: Sequence[Schema],
) -> universal.UndifferentiatedTransformLayerInput

Tokenizes and builds the intermediate tensor components of a multiturn prompt.

Parameters:

Name Type Description Default

sample

Sequence[Schema]

A sequence of messages in the conversation. This argument name is used for consistency with other Mapper classes.

required

Returns:

Type Description
universal.UndifferentiatedTransformLayerInput

The intermediate tensor components of the multiturn prompt.

__init__

Tokenizes and builds tensors for TransformLayer from multiturn messages.

Parameters:

Name Type Description Default

tokenizer

PreTrainedTokenizerBase

The tokenizer to use for tokenization.

required