mistral
Mapper
classes (designed to be compatible with datasets.Dataset.map) useful for building Mistral prompts for Stained Glass Transform
training and testing.
Classes:
Name | Description |
---|---|
MistralInstructionTokenizerMapper |
Tokenizes and builds the intermediate tensor components of a prompt. |
Attributes:
Name | Type | Description |
---|---|---|
MISTRAL_SPECIAL_STRINGS |
Final[InstructionSpecialStrings]
|
Special string components of the Mistral prompt. |
MISTRAL_SPECIAL_STRINGS
module-attribute
¶
MISTRAL_SPECIAL_STRINGS: Final[
InstructionSpecialStrings
] = InstructionSpecialStrings(
INSTRUCTION_START="[INST]",
SYSTEM_PROMPT_START="",
SYSTEM_PROMPT_END="",
CONTEXT_START="###",
INSTRUCTION_END="[/INST]",
)
Special string components of the Mistral prompt.
Based on: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.
The prompt is structured as follows
Mistral makes no distinction between system prompts, prompts, or bodies/context in its instruction tuning format.
Characters like ###
and <<>>
are used to specify boundaries between sections of text.
https://docs.mistral.ai/guides/prompting-capabilities/
https://www.promptingguide.ai/models/mistral-7b#mistral-7b-instruct
MistralInstructionTokenizerMapper
dataclass
¶
Bases: InstructionTokenizerMapper
Tokenizes and builds the intermediate tensor components of a prompt.
Classes:
Name | Description |
---|---|
PromptTokens |
Collection of all tokenized components of the prompt. |
SchemaTokens |
Tokenized intermediate prompt schema. |
SpecialTokens |
Tokenized special components of the prompt. |
Methods:
Name | Description |
---|---|
tokenize |
Tokenize the text. |
Attributes:
Name | Type | Description |
---|---|---|
always_include_context |
bool
|
Whether to always include the start of context tokens in the prompt, even if no context is provided. |
special_tokens |
SpecialTokens
|
The tokenized special prompt strings. |
tokenizer |
PreTrainedTokenizerBase
|
The LLM tokenizer to use. |
always_include_context
class-attribute
instance-attribute
¶
always_include_context: bool = False
Whether to always include the start of context tokens in the prompt, even if no context is provided.
special_tokens
class-attribute
instance-attribute
¶
special_tokens: SpecialTokens = field(init=False)
The tokenized special prompt strings.
PromptTokens
¶
Bases: TypedDict
Collection of all tokenized components of the prompt.
Attributes:
Name | Type | Description |
---|---|---|
schema_tokens |
SchemaTokens
|
The tokenized schema components of the prompt. |
special_tokens |
SpecialTokens
|
The tokenized special components of the prompt. |
schema_tokens
instance-attribute
¶
schema_tokens: SchemaTokens
The tokenized schema components of the prompt.
special_tokens
instance-attribute
¶
special_tokens: SpecialTokens
The tokenized special components of the prompt.
SchemaTokens
¶
Bases: TypedDict
Tokenized intermediate prompt schema.
Attributes:
Name | Type | Description |
---|---|---|
context |
Tensor
|
An optional context to append to the instruction. |
instruction |
Tensor
|
The input to the model. |
response |
Tensor
|
The expected model response to the instruction. |
system_prompt |
Tensor
|
An optional system prompt for the model. |
SpecialTokens
¶
Bases: TypedDict
Tokenized special components of the prompt.
Attributes:
Name | Type | Description |
---|---|---|
bos |
Tensor
|
The beginning of string token. |
context_start |
Tensor
|
The delimiter between the instruction and the context. |
eos |
Tensor
|
The end of string token. |
instruction_end |
Tensor
|
The end of the instruction tag. |
instruction_start |
Tensor
|
The start of the instruction tag. |
system_prompt_end |
Tensor
|
The end of the system prompt. |
system_prompt_start |
Tensor
|
The start of the system prompt. |
MistralMultiturnTransformLayerMapper
¶
Based on https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2#instruction-format.
The prompt for multiturn chats is structured as follows
Added in version 0.87.0.
Changed in version 0.97.0: Messages can optionally start with a system prompt.
Methods:
Name | Description |
---|---|
__call__ |
Tokenizes and builds the intermediate tensor components of a multiturn prompt. |
__init__ |
Tokenizes and builds tensors for |
__call__
¶
Tokenizes and builds the intermediate tensor components of a multiturn prompt.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
Sequence[Schema]
|
A sequence of messages in the conversation. This argument name is used for consistency with other |
required |
Returns:
Type | Description |
---|---|
universal.UndifferentiatedTransformLayerInput
|
The intermediate tensor components of the multiturn prompt. |
__init__
¶
__init__(tokenizer: PreTrainedTokenizerBase) -> None
Tokenizes and builds tensors for TransformLayer
from multiturn messages.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
PreTrainedTokenizerBase
|
The tokenizer to use for tokenization. |
required |