universal
Model-agnostic Mapper
classes (designed to be compatible with datasets.Dataset.map) useful for building LLM prompts for Stained
Glass Transform training and testing.
ChatFormatMapper
dataclass
¶
Builds the tensor components of the transformers.PreTrainedModel chat prompt.
Added in version 0.77.0.
ChatRoleStrings
dataclass
¶
ChatSchemaMapper
dataclass
¶
Bases: SchemaMapper
Maps samples from an arbitrary dataset to a universal schema for building an LLM chat prompt.
Either define a subclass for easier re-use, or use this directly.
Examples:
>>> sample = {
... "question": "What is the capital of France?",
... "response": "Paris",
... "system_prompt": "Answer the following question:",
... }
>>> mapper = ChatSchemaMapper(
... instruction_key="question",
... response_key="response",
... system_prompt_key="system_prompt",
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
[{'role': 'system', 'content': 'Answer the following question:'}, {'role': 'user', 'content': 'What is the capital of France?'}, {'role': 'assistant', 'content': 'Paris'}]
Added in version 0.77.0.
instruction_key
instance-attribute
¶
instruction_key: str
The dataset key/column corresponding to the input.
response_key
instance-attribute
¶
response_key: str | None
An optional dataset key/column corresponding to the expected model response to the instruction.
ChatSpecialStrings
dataclass
¶
Special string components of a chat prompt.
An instance of this class is expected to be defined for each model to dictate the structure of its prompt.
Added in version 0.77.0.
ChatTokenizerMapper
dataclass
¶
Bases: TokenizerMapper
, ABC
Tokenizes and builds the intermediate tensor components of a chat prompt.
Added in version 0.77.0.
special_strings
class-attribute
instance-attribute
¶
special_strings: ChatSpecialStrings = field(init=False)
The special prompt strings to use.
special_tokens
class-attribute
instance-attribute
¶
special_tokens: SpecialTokens = field(init=False)
The tokenized special prompt strings.
PromptTokens
¶
Bases: TypedDict
Collection of all tokenized components of the prompt.
schema_tokens
instance-attribute
¶
schema_tokens: list[SchemaTokens]
The tokenized schema components of the prompt.
special_tokens
instance-attribute
¶
special_tokens: SpecialTokens
The tokenized special components of the prompt.
SchemaTokens
¶
SpecialTokens
¶
Bases: TypedDict
Tokenized special components of the prompt.
InstructionFormatMapper
dataclass
¶
Builds the tensor components of the transformers.PreTrainedModel instruction prompt.
PromptIndices
¶
Bases: TypedDict
Indices of the prompt components in the input_ids
tensor.
Can be used to extract the prompt components from the input_ids
tensor by slicing along the sequence dimension.
Examples:
Using the PromptIndices
to extract the instruction from the input_ids
tensor:
>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
... "special_tokens": {
... "bos": torch.tensor([[1]]),
... "instruction_start": torch.tensor([[2]]),
... "system_prompt_start": torch.tensor([[3]]),
... "system_prompt_end": torch.tensor([[4]]),
... "context_start": torch.tensor([[5]]),
... "instruction_end": torch.tensor([[6]]),
... "eos": torch.tensor([[7]]),
... },
... "schema_tokens": {
... "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
... "response": torch.tensor([[13, 14, 15]]),
... "system_prompt": torch.tensor([[16, 17, 18, 19]]),
... "context": torch.tensor([[20, 21, 22]]),
... },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
... sample["schema_tokens"]["instruction"],
... formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )
InstructionSchemaMapper
dataclass
¶
Bases: SchemaMapper
Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.
Either define a subclass for easier re-use, or use this class directly.
Examples:
>>> sample = {
... "question": "What is the capital of France?",
... "response": "Paris",
... "system_prompt": "Answer the following question:",
... }
>>> mapper = InstructionSchemaMapper(
... instruction_key="question",
... response_key="response",
... system_prompt_key="system_prompt",
... context_key=None,
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
{'instruction': 'What is the capital of France?', 'response': 'Paris', 'context': '', 'system_prompt': 'Answer the following question:'}
context_key
instance-attribute
¶
context_key: str | None
An optional dataset key/column corresponding to context to append to the instruction.
instruction_key
instance-attribute
¶
instruction_key: str
The dataset key/column corresponding to the input.
response_key
instance-attribute
¶
response_key: str | None
An optional dataset key/column corresponding to the expected model response to the instruction.
system_prompt_key
instance-attribute
¶
system_prompt_key: str | None
An optional dataset key/column corresponding to the system prompt for the model.
InstructionSpecialStrings
dataclass
¶
Special string components of an instruction-tuning prompt.
An instance of this class is expected to be defined for each model to dictate the structure of its prompt.
Added in version 0.77.0. Renamed `SpecialStrings` to `InstructionSpecialStrings`.
CONTEXT_START
instance-attribute
¶
The delimiter between the instruction and the context.
INSTRUCTION_END
instance-attribute
¶
The end of the instruction tag. The model is highly sensitive to this tag.
INSTRUCTION_START
instance-attribute
¶
The start of the instruction. The model is highly sensitive to this tag.
InstructionTokenizerMapper
dataclass
¶
Bases: TokenizerMapper
, ABC
Tokenizes and builds the intermediate tensor components of an instruction prompt.
always_include_context
class-attribute
instance-attribute
¶
always_include_context: bool = False
Whether to always include the start of context tokens in the prompt, even if no context is provided.
special_strings
class-attribute
instance-attribute
¶
special_strings: InstructionSpecialStrings = field(init=False)
The special prompt strings to use.
special_tokens
class-attribute
instance-attribute
¶
special_tokens: SpecialTokens = field(init=False)
The tokenized special prompt strings.
PromptTokens
¶
Bases: TypedDict
Collection of all tokenized components of the prompt.
schema_tokens
instance-attribute
¶
schema_tokens: SchemaTokens
The tokenized schema components of the prompt.
special_tokens
instance-attribute
¶
special_tokens: SpecialTokens
The tokenized special components of the prompt.
SchemaTokens
¶
Bases: TypedDict
Tokenized intermediate prompt schema.
SchemaMapper
dataclass
¶
Bases: ABC
Maps samples from an arbitrary dataset to a universal schema for building an LLM prompt.
Added in version 0.77.0. Base class for `InstructionSchemaMapper` and `ChatSchemaMapper`.
instruction_key
instance-attribute
¶
instruction_key: str
The dataset key/column corresponding to the input.
response_key
instance-attribute
¶
response_key: str | None
An optional dataset key/column corresponding to the expected model response to the instruction.
system_prompt_key
instance-attribute
¶
system_prompt_key: str | None
An optional dataset key/column corresponding to the system prompt for the model.
Schema
¶
Bases: TypedDict
Base schema for building an LLM prompt.
TensorToListMapper
dataclass
¶
Maps a dictionary of int64 tensors to a dictionary of lists of int
.
TestMapper
dataclass
¶
Formats the undifferentiated LlamaForCausalLM
input for testing.
Added in version 0.77.0. Renamed `InstructionTestMapper` to `TestMapper`.
TestInput
¶
Bases: TypedDict
, Generic[ContainerT]
Input for LlamaForCausalLM
testing.
TokenizerMapper
dataclass
¶
Bases: ABC
Tokenizes and builds the intermediate tensor components of a prompt.
Added in version 0.77.0. Base class for `InstructionTokenizerMapper` and `ChatTokenizerMapper`.
TrainMapper
dataclass
¶
Formats the undifferentiated transformers.PreTrainedModel input for training.
Added in version 0.77.0. Renamed `InstructionTrainMapper` to `TrainMapper`.
TrainInput
¶
TransformLayerChatFormatMapper
dataclass
¶
Bases: TransformLayerFormatMapper
, ChatFormatMapper
Builds the noise token mask for a chat prompt, which is required for training a TransformLayer
.
Added in version 0.77.0.
Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper
TransformLayerFormatMapper
dataclass
¶
Base class for building noise token mask.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform_all_tokens |
bool
|
Whether to to transform all the tokens, or only the instruction, context, and possibly the system prompt. |
False
|
Added in version 0.77.0. Base class for `TransformLayerInstructionFormatMapper` and `TransformLayerChatFormatMapper`.
Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper
TransformLayerInstructionFormatMapper
dataclass
¶
Bases: TransformLayerFormatMapper
, InstructionFormatMapper
Builds the noise token mask for a instruction prompt, which is required for training a TransformLayer
.
PromptIndices
¶
Bases: TypedDict
Indices of the prompt components in the input_ids
tensor.
Can be used to extract the prompt components from the input_ids
tensor by slicing along the sequence dimension.
Examples:
Using the PromptIndices
to extract the instruction from the input_ids
tensor:
>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
... "special_tokens": {
... "bos": torch.tensor([[1]]),
... "instruction_start": torch.tensor([[2]]),
... "system_prompt_start": torch.tensor([[3]]),
... "system_prompt_end": torch.tensor([[4]]),
... "context_start": torch.tensor([[5]]),
... "instruction_end": torch.tensor([[6]]),
... "eos": torch.tensor([[7]]),
... },
... "schema_tokens": {
... "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
... "response": torch.tensor([[13, 14, 15]]),
... "system_prompt": torch.tensor([[16, 17, 18, 19]]),
... "context": torch.tensor([[20, 21, 22]]),
... },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
... sample["schema_tokens"]["instruction"],
... formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )
__call__
¶
__call__(sample: PromptTokens) -> UndifferentiatedTransformLayerInput
Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.
Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper
TransformLayerTestMapper
dataclass
¶
Bases: TestMapper
Formats the undifferentiated InstructionTransformLayer
input for testing.
Added in version 0.77.0. Renamed `TransformLayerInstructionTestMapper` to `TransformLayerTestMapper`.
TestInput
¶
Bases: TypedDict
, Generic[ContainerT]
Input for LlamaForCausalLM
testing.
TransformLayerTrainMapper
dataclass
¶
Bases: TrainMapper
Formats the undifferentiated InstructionTransformLayer
input for training.
Added in version 0.77.0. Renamed `TransformLayerInstructionTrainMapper` to `TransformLayerTrainMapper`.
ignore_prompt_loss
class-attribute
instance-attribute
¶
ignore_prompt_loss: bool = True
Whether to ignore the loss on the prompt tokens.
TrainInput
¶
TransformLayerTrainInput
¶
Bases: TrainInput[ContainerT]
Input for TransformLayer
training.
__call__
¶
__call__(sample: UndifferentiatedTransformLayerInput) -> TransformLayerTrainInput[torch.Tensor]
Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.
UndifferentiatedInput
¶
Bases: TypedDict
Formatted input for the transformers.PreTrainedModel that must be further formatted into either training or testing input.
Must be further formatted based on if the model is being trained or evaluated.
Added in version 0.77.0. Renamed `InstructionFormatMapper.UndifferentiatedInstructionInput` to `UndifferentiatedInput`.
UndifferentiatedTransformLayerInput
¶
Bases: UndifferentiatedInput
Formatted input for the TransformLayer
that must be further formatted into
either training or testing input.
Must be further formatted based on if the model is being trained or evaluated.
Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.
Added in version 0.77.0. Renamed `TransformLayerInstructionFormatMapper.UndifferentiatedTransformLayerInstructionInput` to `UndifferentiatedTransformLayerInput`.