Skip to content

universal

Model-agnostic Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building LLM prompts for Stained Glass Transform training and testing.

ChatFormatMapper dataclass

Builds the tensor components of the transformers.PreTrainedModel chat prompt.

Added in version 0.77.0.

ChatRoleStrings dataclass

Role strings of a chat prompt.

Added in version 0.77.0.

ASSISTANT_ROLE class-attribute instance-attribute

ASSISTANT_ROLE: Final[str] = 'assistant'

The assistant role.

SYSTEM_ROLE class-attribute instance-attribute

SYSTEM_ROLE: Final[str] = 'system'

The system role.

USER_ROLE class-attribute instance-attribute

USER_ROLE: Final[str] = 'user'

The user role.

ChatSchemaMapper dataclass

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM chat prompt.

Either define a subclass for easier reuse, or use this directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
...     "response": "Paris",
...     "system_prompt": "Answer the following question:",
... }
>>> mapper = ChatSchemaMapper(
...     instruction_key="question",
...     response_key="response",
...     system_prompt_key="system_prompt",
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
[{'role': 'system', 'content': 'Answer the following question:'}, {'role': 'user', 'content': 'What is the capital of France?'}, {'role': 'assistant', 'content': 'Paris'}]

Added in version 0.77.0.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

response_key instance-attribute

response_key: str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key instance-attribute

system_prompt_key: str | None

An optional dataset key/column corresponding to the system prompt for the model.

Schema

Bases: Schema

Universal schema for building an LLM chat prompt.

Added in version 0.77.0.

content instance-attribute

content: str

The content of the message.

role instance-attribute

role: str

The role of the message.

ChatSpecialStrings dataclass

Special string components of a chat prompt.

An instance of this class is expected to be defined for each model to dictate the structure of its prompt.

Added in version 0.77.0.

MESSAGE_END instance-attribute

MESSAGE_END: Final[str]

The end of a message.

ROLES instance-attribute

The role strings of a chat prompt.

ROLE_HEADER_END instance-attribute

ROLE_HEADER_END: Final[str]

The end of a role header.

ROLE_HEADER_START instance-attribute

ROLE_HEADER_START: Final[str]

The start of a role header.

ChatTokenizerMapper dataclass

Bases: TokenizerMapper, ABC

Tokenizes and builds the intermediate tensor components of a chat prompt.

Added in version 0.77.0.

special_strings class-attribute instance-attribute

special_strings: ChatSpecialStrings = field(init=False)

The special prompt strings to use.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

schema_tokens instance-attribute

schema_tokens: list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

content instance-attribute

content: Tensor

The content of the message.

role instance-attribute

role: Tensor

The role of the message.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

assistant_role instance-attribute

assistant_role: Tensor

The assistant role.

bos instance-attribute

bos: Tensor

The beginning of string token.

message_end instance-attribute

message_end: Tensor

The end of a message.

role_header_end instance-attribute

role_header_end: Tensor

The end of the role header.

role_header_start instance-attribute

role_header_start: Tensor

The start of the role header.

system_role instance-attribute

system_role: Tensor

The system role.

user_role instance-attribute

user_role: Tensor

The user role.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

InstructionFormatMapper dataclass

Builds the tensor components of the transformers.PreTrainedModel instruction prompt.

PromptIndices

Bases: TypedDict

Indices of the prompt components in the input_ids tensor.

Can be used to extract the prompt components from the input_ids tensor by slicing along the sequence dimension.

Examples:

Using the PromptIndices to extract the instruction from the input_ids tensor:

>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
...     "special_tokens": {
...         "bos": torch.tensor([[1]]),
...         "instruction_start": torch.tensor([[2]]),
...         "system_prompt_start": torch.tensor([[3]]),
...         "system_prompt_end": torch.tensor([[4]]),
...         "context_start": torch.tensor([[5]]),
...         "instruction_end": torch.tensor([[6]]),
...         "eos": torch.tensor([[7]]),
...     },
...     "schema_tokens": {
...         "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
...         "response": torch.tensor([[13, 14, 15]]),
...         "system_prompt": torch.tensor([[16, 17, 18, 19]]),
...         "context": torch.tensor([[20, 21, 22]]),
...     },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
...     sample["schema_tokens"]["instruction"],
...     formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )

context instance-attribute

context: slice

The slice of input_ids containing the context.

instruction instance-attribute

instruction: slice

The slice of input_ids containing the instruction.

system_prompt instance-attribute

system_prompt: slice

The slice of input_ids containing the system prompt.

InstructionSchemaMapper dataclass

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

Either define a subclass for easier reuse, or use this class directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
...     "response": "Paris",
...     "system_prompt": "Answer the following question:",
... }
>>> mapper = InstructionSchemaMapper(
...     instruction_key="question",
...     response_key="response",
...     system_prompt_key="system_prompt",
...     context_key=None,
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
{'instruction': 'What is the capital of France?', 'response': 'Paris', 'context': '', 'system_prompt': 'Answer the following question:'}

context_key instance-attribute

context_key: str | None

An optional dataset key/column corresponding to context to append to the instruction.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

response_key instance-attribute

response_key: str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key instance-attribute

system_prompt_key: str | None

An optional dataset key/column corresponding to the system prompt for the model.

Schema

Bases: Schema

Universal schema for building an LLM instruction prompt.

Added in version 0.77.0. Renamed `InstructionSchema` to `InstructionSchemaMapper.Schema`.

context instance-attribute

context: str

An optional context to append to the instruction.

instruction instance-attribute

instruction: str

The input to the model.

response instance-attribute

response: str

The optional expected model response to the instruction.

system_prompt instance-attribute

system_prompt: str

An optional system prompt for the model.

InstructionSpecialStrings dataclass

Special string components of an instruction-tuning prompt.

An instance of this class is expected to be defined for each model to dictate the structure of its prompt.

Added in version 0.77.0. Renamed `SpecialStrings` to `InstructionSpecialStrings`.

CONTEXT_START instance-attribute

CONTEXT_START: Final[str]

The delimiter between the instruction and the context.

INSTRUCTION_END instance-attribute

INSTRUCTION_END: Final[str]

The end of the instruction tag. The model is highly sensitive to this tag.

INSTRUCTION_START instance-attribute

INSTRUCTION_START: Final[str]

The start of the instruction. The model is highly sensitive to this tag.

SYSTEM_PROMPT_END instance-attribute

SYSTEM_PROMPT_END: Final[str]

The end of the system prompt.

SYSTEM_PROMPT_START instance-attribute

SYSTEM_PROMPT_START: Final[str]

The start of the system prompt.

InstructionTokenizerMapper dataclass

Bases: TokenizerMapper, ABC

Tokenizes and builds the intermediate tensor components of an instruction prompt.

always_include_context class-attribute instance-attribute

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_strings class-attribute instance-attribute

special_strings: InstructionSpecialStrings = field(
    init=False
)

The special prompt strings to use.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

context instance-attribute

context: Tensor

An optional context to append to the instruction.

instruction instance-attribute

instruction: Tensor

The input to the model.

response instance-attribute

response: Tensor

The expected model response to the instruction.

system_prompt instance-attribute

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

bos instance-attribute

bos: Tensor

The beginning of string token.

context_start instance-attribute

context_start: Tensor

The delimiter between the instruction and the context.

eos instance-attribute

eos: Tensor

The end of string token.

instruction_end instance-attribute

instruction_end: Tensor

The end of the instruction tag.

instruction_start instance-attribute

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end instance-attribute

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start instance-attribute

system_prompt_start: Tensor

The start of the system prompt.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

PreTrainFormatMapper dataclass

Builds the tensor components of the transformers.PreTrainedModel pretraining prompt.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

PreTrainSchemaMapper dataclass

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

Either define a subclass for easier reuse, or use this class directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
... }
>>> mapper = PreTrainSchemaMapper(
...     instruction_key="question",
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
{'text': 'What is the capital of France?'}

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

Schema

Bases: Schema

Universal schema for building an LLM instruction prompt.

text instance-attribute

text: str

The input to the model.

PreTrainTokenizerMapper dataclass

Bases: TokenizerMapper

Tokenizes and builds the intermediate tensor components of a pretraining input which does not have a prompt.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

text instance-attribute

text: Tensor

The input to the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

bos instance-attribute

bos: Tensor

The beginning of string token.

eos instance-attribute

eos: Tensor

The end of string token.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

SchemaMapper dataclass

Bases: ABC

Maps samples from an arbitrary dataset to a universal schema for building an LLM prompt.

Added in version 0.77.0. Base class for `InstructionSchemaMapper` and `ChatSchemaMapper`.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

Schema

Bases: TypedDict

Base schema for building an LLM prompt.

TensorToListMapper dataclass

Maps a dictionary of int64 tensors to a dictionary of lists of int.

TestMapper dataclass

Formats the undifferentiated LlamaForCausalLM input for testing.

Added in version 0.77.0. Renamed `InstructionTestMapper` to `TestMapper`.

TestInput

Bases: TypedDict, Generic[ContainerT]

Input for LlamaForCausalLM testing.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

labels instance-attribute

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

TokenizerMapper dataclass

Bases: ABC

Tokenizes and builds the intermediate tensor components of a prompt.

Added in version 0.77.0. Base class for `InstructionTokenizerMapper` and `ChatTokenizerMapper`.

tokenizer instance-attribute

The LLM tokenizer to use.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

TrainMapper dataclass

Formats the undifferentiated transformers.PreTrainedModel input for training.

Added in version 0.77.0. Renamed `InstructionTrainMapper` to `TrainMapper`.

TrainInput

Bases: TypedDict, Generic[ContainerT]

Input for transformers.PreTrainedModel training.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

TransformLayerChatFormatMapper dataclass

Bases: TransformLayerFormatMapper, ChatFormatMapper

Builds the noise token mask for a chat prompt, which is required for training a TransformLayer.

Added in version 0.77.0.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerFormatMapper dataclass

Base class for building noise token mask.

Parameters:

Name Type Description Default

transform_all_tokens

bool

Whether to to transform all the tokens, or only the instruction, context, and possibly the system prompt.

False

Added in version 0.77.0. Base class for `TransformLayerInstructionFormatMapper` and `TransformLayerChatFormatMapper`.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerInstructionFormatMapper dataclass

Bases: TransformLayerFormatMapper, InstructionFormatMapper

Builds the noise token mask for a instruction prompt, which is required for training a TransformLayer.

PromptIndices

Bases: TypedDict

Indices of the prompt components in the input_ids tensor.

Can be used to extract the prompt components from the input_ids tensor by slicing along the sequence dimension.

Examples:

Using the PromptIndices to extract the instruction from the input_ids tensor:

>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
...     "special_tokens": {
...         "bos": torch.tensor([[1]]),
...         "instruction_start": torch.tensor([[2]]),
...         "system_prompt_start": torch.tensor([[3]]),
...         "system_prompt_end": torch.tensor([[4]]),
...         "context_start": torch.tensor([[5]]),
...         "instruction_end": torch.tensor([[6]]),
...         "eos": torch.tensor([[7]]),
...     },
...     "schema_tokens": {
...         "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
...         "response": torch.tensor([[13, 14, 15]]),
...         "system_prompt": torch.tensor([[16, 17, 18, 19]]),
...         "context": torch.tensor([[20, 21, 22]]),
...     },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
...     sample["schema_tokens"]["instruction"],
...     formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )

context instance-attribute

context: slice

The slice of input_ids containing the context.

instruction instance-attribute

instruction: slice

The slice of input_ids containing the instruction.

system_prompt instance-attribute

system_prompt: slice

The slice of input_ids containing the system prompt.

__call__

__call__(
    sample: PromptTokens,
) -> UndifferentiatedTransformLayerInput

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerPreTrainFormatMapper dataclass

Bases: TransformLayerFormatMapper, PreTrainFormatMapper

Builds the noise token mask for a pretraining scenario which does not use a templated prompt, which is required for training a TransformLayer.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

TransformLayerTestMapper dataclass

Bases: TestMapper

Formats the undifferentiated InstructionTransformLayer input for testing.

Added in version 0.77.0. Renamed `TransformLayerInstructionTestMapper` to `TransformLayerTestMapper`.

TestInput

Bases: TypedDict, Generic[ContainerT]

Input for LlamaForCausalLM testing.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

labels instance-attribute

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

TransformLayerTestInput

Bases: TestInput[ContainerT]

Input for InstructionTransformLayer testing.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

labels instance-attribute

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

noise_mask instance-attribute

noise_mask: ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

TransformLayerTrainMapper dataclass

Bases: TrainMapper

Formats the undifferentiated InstructionTransformLayer input for training.

Added in version 0.77.0. Renamed `TransformLayerInstructionTrainMapper` to `TransformLayerTrainMapper`.

ignore_prompt_loss class-attribute instance-attribute

ignore_prompt_loss: bool = True

Whether to ignore the loss on the prompt tokens.

TrainInput

Bases: TypedDict, Generic[ContainerT]

Input for transformers.PreTrainedModel training.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

TransformLayerTrainInput

Bases: TrainInput[ContainerT]

Input for TransformLayer training.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

loss_mask instance-attribute

loss_mask: ContainerT

The mask that dictates which tokens in input_ids to use to calculate the loss.

noise_mask instance-attribute

noise_mask: ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

__call__

__call__(
    sample: UndifferentiatedTransformLayerInput,
) -> TransformLayerTrainInput[torch.Tensor]

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

UndifferentiatedInput

Bases: TypedDict

Formatted input for the transformers.PreTrainedModel that must be further formatted into either training or testing input.

Must be further formatted based on if the model is being trained or evaluated.

Added in version 0.77.0. Renamed `InstructionFormatMapper.UndifferentiatedInstructionInput` to `UndifferentiatedInput`.

input_ids instance-attribute

input_ids: Tensor

The input token ids.

response instance-attribute

response: NotRequired[Tensor]

The expected model response to the input_ids.

UndifferentiatedTransformLayerInput

Bases: UndifferentiatedInput

Formatted input for the TransformLayer that must be further formatted into either training or testing input.

Must be further formatted based on if the model is being trained or evaluated.

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Added in version 0.77.0. Renamed `TransformLayerInstructionFormatMapper.UndifferentiatedTransformLayerInstructionInput` to `UndifferentiatedTransformLayerInput`.

input_ids instance-attribute

input_ids: Tensor

The input token ids.

noise_mask instance-attribute

noise_mask: Tensor

The mask that dictates which tokens in input_ids to obfuscate.

response instance-attribute

response: NotRequired[Tensor]

The expected model response to the input_ids.