Skip to content

universal

Model-agnostic Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building LLM prompts for Stained Glass Transform training and testing.

Classes:

Name Description
ChatFormatMapper

Builds the tensor components of the transformers.PreTrainedModel chat prompt.

ChatRoleStrings

Role strings of a chat prompt.

ChatSchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM chat prompt.

ChatSpecialStrings

Special string components of a chat prompt.

ChatTokenizerMapper

Tokenizes and builds the intermediate tensor components of a chat prompt.

InstructionFormatMapper

Builds the tensor components of the transformers.PreTrainedModel instruction prompt.

InstructionSchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

InstructionSpecialStrings

Special string components of an instruction-tuning prompt.

InstructionTokenizerMapper

Tokenizes and builds the intermediate tensor components of an instruction prompt.

PreTrainFormatMapper

Builds the tensor components of the transformers.PreTrainedModel pretraining prompt.

PreTrainSchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

PreTrainTokenizerMapper

Tokenizes and builds the intermediate tensor components of a pretraining input which does not have a prompt.

SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM prompt.

TensorToListMapper

Maps a dictionary of int64 tensors to a dictionary of lists of int.

TestMapper

Formats the undifferentiated LlamaForCausalLM input for testing.

TokenizerMapper

Tokenizes and builds the intermediate tensor components of a prompt.

TrainMapper

Formats the undifferentiated transformers.PreTrainedModel input for training.

TransformLayerChatFormatMapper

Builds the noise token mask for a chat prompt, which is required for training a TransformLayer.

TransformLayerFormatMapper

Base class for building noise token mask.

TransformLayerInstructionFormatMapper

Builds the noise token mask for a instruction prompt, which is required for training a TransformLayer.

TransformLayerPreTrainFormatMapper

Builds the noise token mask for a pretraining scenario which does not use a templated prompt, which is required for training a

TransformLayerTestMapper

Formats the undifferentiated InstructionTransformLayer input for testing.

TransformLayerTrainMapper

Formats the undifferentiated InstructionTransformLayer input for training.

UndifferentiatedInput

Formatted input for the transformers.PreTrainedModel that must be further formatted into either training or testing input.

UndifferentiatedTransformLayerInput

Formatted input for the TransformLayer that must be further formatted into

ChatFormatMapper dataclass

Builds the tensor components of the transformers.PreTrainedModel chat prompt.

Added in version 0.77.0.

ChatRoleStrings dataclass

Role strings of a chat prompt.

Added in version 0.77.0.

Attributes:

Name Type Description
ASSISTANT_ROLE Final[str]

The assistant role.

SYSTEM_ROLE Final[str]

The system role.

USER_ROLE Final[str]

The user role.

ASSISTANT_ROLE class-attribute instance-attribute

ASSISTANT_ROLE: Final[str] = 'assistant'

The assistant role.

SYSTEM_ROLE class-attribute instance-attribute

SYSTEM_ROLE: Final[str] = 'system'

The system role.

USER_ROLE class-attribute instance-attribute

USER_ROLE: Final[str] = 'user'

The user role.

ChatSchemaMapper dataclass

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM chat prompt.

Either define a subclass for easier reuse, or use this directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
...     "response": "Paris",
...     "system_prompt": "Answer the following question:",
... }
>>> mapper = ChatSchemaMapper(
...     instruction_key="question",
...     response_key="response",
...     system_prompt_key="system_prompt",
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
[{'role': 'system', 'content': 'Answer the following question:'}, {'role': 'user', 'content': 'What is the capital of France?'}, {'role': 'assistant', 'content': 'Paris'}]

Added in version 0.77.0.

Classes:

Name Description
Schema

Universal schema for building an LLM chat prompt.

Attributes:

Name Type Description
instruction_key str

The dataset key/column corresponding to the input.

response_key str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key str | None

An optional dataset key/column corresponding to the system prompt for the model.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

response_key instance-attribute

response_key: str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key instance-attribute

system_prompt_key: str | None

An optional dataset key/column corresponding to the system prompt for the model.

Schema

Bases: Schema

Universal schema for building an LLM chat prompt.

Added in version 0.77.0.

Attributes:

Name Type Description
content str

The content of the message.

role str

The role of the message.

content instance-attribute

content: str

The content of the message.

role instance-attribute

role: str

The role of the message.

ChatSpecialStrings dataclass

Special string components of a chat prompt.

An instance of this class is expected to be defined for each model to dictate the structure of its prompt.

Added in version 0.77.0.

Attributes:

Name Type Description
MESSAGE_END Final[str]

The end of a message.

ROLES Final[ChatRoleStrings]

The role strings of a chat prompt.

ROLE_HEADER_END Final[str]

The end of a role header.

ROLE_HEADER_START Final[str]

The start of a role header.

MESSAGE_END instance-attribute

MESSAGE_END: Final[str]

The end of a message.

ROLES instance-attribute

The role strings of a chat prompt.

ROLE_HEADER_END instance-attribute

ROLE_HEADER_END: Final[str]

The end of a role header.

ROLE_HEADER_START instance-attribute

ROLE_HEADER_START: Final[str]

The start of a role header.

ChatTokenizerMapper dataclass

Bases: TokenizerMapper, ABC

Tokenizes and builds the intermediate tensor components of a chat prompt.

Added in version 0.77.0.

Classes:

Name Description
PromptTokens

Collection of all tokenized components of the prompt.

SchemaTokens

Tokenized intermediate prompt schema.

SpecialTokens

Tokenized special components of the prompt.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
special_strings ChatSpecialStrings

The special prompt strings to use.

special_tokens SpecialTokens

The tokenized special prompt strings.

tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

special_strings class-attribute instance-attribute

special_strings: ChatSpecialStrings = field(init=False)

The special prompt strings to use.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name Type Description
schema_tokens list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens SpecialTokens

The tokenized special components of the prompt.

schema_tokens instance-attribute

schema_tokens: list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name Type Description
content Tensor

The content of the message.

role Tensor

The role of the message.

content instance-attribute

content: Tensor

The content of the message.

role instance-attribute

role: Tensor

The role of the message.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name Type Description
assistant_role Tensor

The assistant role.

bos Tensor

The beginning of string token.

message_end Tensor

The end of a message.

role_header_end Tensor

The end of the role header.

role_header_start Tensor

The start of the role header.

system_role Tensor

The system role.

user_role Tensor

The user role.

assistant_role instance-attribute

assistant_role: Tensor

The assistant role.

bos instance-attribute

bos: Tensor

The beginning of string token.

message_end instance-attribute

message_end: Tensor

The end of a message.

role_header_end instance-attribute

role_header_end: Tensor

The end of the role header.

role_header_start instance-attribute

role_header_start: Tensor

The start of the role header.

system_role instance-attribute

system_role: Tensor

The system role.

user_role instance-attribute

user_role: Tensor

The user role.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

InstructionFormatMapper dataclass

Builds the tensor components of the transformers.PreTrainedModel instruction prompt.

Classes:

Name Description
PromptIndices

Indices of the prompt components in the input_ids tensor.

PromptIndices

Bases: TypedDict

Indices of the prompt components in the input_ids tensor.

Can be used to extract the prompt components from the input_ids tensor by slicing along the sequence dimension.

Examples:

Using the PromptIndices to extract the instruction from the input_ids tensor:

>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
...     "special_tokens": {
...         "bos": torch.tensor([[1]]),
...         "instruction_start": torch.tensor([[2]]),
...         "system_prompt_start": torch.tensor([[3]]),
...         "system_prompt_end": torch.tensor([[4]]),
...         "context_start": torch.tensor([[5]]),
...         "instruction_end": torch.tensor([[6]]),
...         "eos": torch.tensor([[7]]),
...     },
...     "schema_tokens": {
...         "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
...         "response": torch.tensor([[13, 14, 15]]),
...         "system_prompt": torch.tensor([[16, 17, 18, 19]]),
...         "context": torch.tensor([[20, 21, 22]]),
...     },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
...     sample["schema_tokens"]["instruction"],
...     formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )

Attributes:

Name Type Description
context slice

The slice of input_ids containing the context.

instruction slice

The slice of input_ids containing the instruction.

system_prompt slice

The slice of input_ids containing the system prompt.

context instance-attribute

context: slice

The slice of input_ids containing the context.

instruction instance-attribute

instruction: slice

The slice of input_ids containing the instruction.

system_prompt instance-attribute

system_prompt: slice

The slice of input_ids containing the system prompt.

InstructionSchemaMapper dataclass

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

Either define a subclass for easier reuse, or use this class directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
...     "response": "Paris",
...     "system_prompt": "Answer the following question:",
... }
>>> mapper = InstructionSchemaMapper(
...     instruction_key="question",
...     response_key="response",
...     system_prompt_key="system_prompt",
...     context_key=None,
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
{'instruction': 'What is the capital of France?', 'response': 'Paris', 'context': '', 'system_prompt': 'Answer the following question:'}

Classes:

Name Description
Schema

Universal schema for building an LLM instruction prompt.

Attributes:

Name Type Description
context_key str | None

An optional dataset key/column corresponding to context to append to the instruction.

instruction_key str

The dataset key/column corresponding to the input.

response_key str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key str | None

An optional dataset key/column corresponding to the system prompt for the model.

context_key instance-attribute

context_key: str | None

An optional dataset key/column corresponding to context to append to the instruction.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

response_key instance-attribute

response_key: str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key instance-attribute

system_prompt_key: str | None

An optional dataset key/column corresponding to the system prompt for the model.

Schema

Bases: Schema

Universal schema for building an LLM instruction prompt.

Added in version 0.77.0. Renamed `InstructionSchema` to `InstructionSchemaMapper.Schema`.

Attributes:

Name Type Description
context str

An optional context to append to the instruction.

instruction str

The input to the model.

response str

The optional expected model response to the instruction.

system_prompt str

An optional system prompt for the model.

context instance-attribute

context: str

An optional context to append to the instruction.

instruction instance-attribute

instruction: str

The input to the model.

response instance-attribute

response: str

The optional expected model response to the instruction.

system_prompt instance-attribute

system_prompt: str

An optional system prompt for the model.

InstructionSpecialStrings dataclass

Special string components of an instruction-tuning prompt.

An instance of this class is expected to be defined for each model to dictate the structure of its prompt.

Added in version 0.77.0. Renamed `SpecialStrings` to `InstructionSpecialStrings`.

Attributes:

Name Type Description
CONTEXT_START Final[str]

The delimiter between the instruction and the context.

INSTRUCTION_END Final[str]

The end of the instruction tag. The model is highly sensitive to this tag.

INSTRUCTION_START Final[str]

The start of the instruction. The model is highly sensitive to this tag.

SYSTEM_PROMPT_END Final[str]

The end of the system prompt.

SYSTEM_PROMPT_START Final[str]

The start of the system prompt.

CONTEXT_START instance-attribute

CONTEXT_START: Final[str]

The delimiter between the instruction and the context.

INSTRUCTION_END instance-attribute

INSTRUCTION_END: Final[str]

The end of the instruction tag. The model is highly sensitive to this tag.

INSTRUCTION_START instance-attribute

INSTRUCTION_START: Final[str]

The start of the instruction. The model is highly sensitive to this tag.

SYSTEM_PROMPT_END instance-attribute

SYSTEM_PROMPT_END: Final[str]

The end of the system prompt.

SYSTEM_PROMPT_START instance-attribute

SYSTEM_PROMPT_START: Final[str]

The start of the system prompt.

InstructionTokenizerMapper dataclass

Bases: TokenizerMapper, ABC

Tokenizes and builds the intermediate tensor components of an instruction prompt.

Classes:

Name Description
PromptTokens

Collection of all tokenized components of the prompt.

SchemaTokens

Tokenized intermediate prompt schema.

SpecialTokens

Tokenized special components of the prompt.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
always_include_context bool

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_strings InstructionSpecialStrings

The special prompt strings to use.

special_tokens SpecialTokens

The tokenized special prompt strings.

tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

always_include_context class-attribute instance-attribute

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_strings class-attribute instance-attribute

special_strings: InstructionSpecialStrings = field(
    init=False
)

The special prompt strings to use.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name Type Description
schema_tokens SchemaTokens

The tokenized schema components of the prompt.

special_tokens SpecialTokens

The tokenized special components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name Type Description
context Tensor

An optional context to append to the instruction.

instruction Tensor

The input to the model.

response Tensor

The expected model response to the instruction.

system_prompt Tensor

An optional system prompt for the model.

context instance-attribute

context: Tensor

An optional context to append to the instruction.

instruction instance-attribute

instruction: Tensor

The input to the model.

response instance-attribute

response: Tensor

The expected model response to the instruction.

system_prompt instance-attribute

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name Type Description
bos Tensor

The beginning of string token.

context_start Tensor

The delimiter between the instruction and the context.

eos Tensor

The end of string token.

instruction_end Tensor

The end of the instruction tag.

instruction_start Tensor

The start of the instruction tag.

system_prompt_end Tensor

The end of the system prompt.

system_prompt_start Tensor

The start of the system prompt.

bos instance-attribute

bos: Tensor

The beginning of string token.

context_start instance-attribute

context_start: Tensor

The delimiter between the instruction and the context.

eos instance-attribute

eos: Tensor

The end of string token.

instruction_end instance-attribute

instruction_end: Tensor

The end of the instruction tag.

instruction_start instance-attribute

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end instance-attribute

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start instance-attribute

system_prompt_start: Tensor

The start of the system prompt.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

PreTrainFormatMapper dataclass

Builds the tensor components of the transformers.PreTrainedModel pretraining prompt.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

PreTrainSchemaMapper dataclass

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

Either define a subclass for easier reuse, or use this class directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
... }
>>> mapper = PreTrainSchemaMapper(
...     instruction_key="question",
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
{'text': 'What is the capital of France?'}

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

Classes:

Name Description
Schema

Universal schema for building an LLM instruction prompt.

Attributes:

Name Type Description
instruction_key str

The dataset key/column corresponding to the input.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

Schema

Bases: Schema

Universal schema for building an LLM instruction prompt.

Attributes:

Name Type Description
text str

The input to the model.

text instance-attribute

text: str

The input to the model.

PreTrainTokenizerMapper dataclass

Bases: TokenizerMapper

Tokenizes and builds the intermediate tensor components of a pretraining input which does not have a prompt.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

Classes:

Name Description
PromptTokens

Collection of all tokenized components of the prompt.

SchemaTokens

Tokenized intermediate prompt schema.

SpecialTokens

Tokenized special components of the prompt.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
special_tokens SpecialTokens

The tokenized special prompt strings.

tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

special_tokens class-attribute instance-attribute

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer instance-attribute

The LLM tokenizer to use.

PromptTokens

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name Type Description
schema_tokens SchemaTokens

The tokenized schema components of the prompt.

special_tokens SpecialTokens

The tokenized special components of the prompt.

schema_tokens instance-attribute

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens instance-attribute

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name Type Description
text Tensor

The input to the model.

text instance-attribute

text: Tensor

The input to the model.

SpecialTokens

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name Type Description
bos Tensor

The beginning of string token.

eos Tensor

The end of string token.

bos instance-attribute

bos: Tensor

The beginning of string token.

eos instance-attribute

eos: Tensor

The end of string token.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

SchemaMapper dataclass

Bases: ABC

Maps samples from an arbitrary dataset to a universal schema for building an LLM prompt.

Added in version 0.77.0. Base class for `InstructionSchemaMapper` and `ChatSchemaMapper`.

Classes:

Name Description
Schema

Base schema for building an LLM prompt.

Attributes:

Name Type Description
instruction_key str

The dataset key/column corresponding to the input.

instruction_key instance-attribute

instruction_key: str

The dataset key/column corresponding to the input.

Schema

Bases: TypedDict

Base schema for building an LLM prompt.

TensorToListMapper dataclass

Maps a dictionary of int64 tensors to a dictionary of lists of int.

TestMapper dataclass

Formats the undifferentiated LlamaForCausalLM input for testing.

Added in version 0.77.0. Renamed `InstructionTestMapper` to `TestMapper`.

Classes:

Name Description
TestInput

Input for LlamaForCausalLM testing.

TestInput

Bases: TypedDict, Generic[ContainerT]

Input for LlamaForCausalLM testing.

Attributes:

Name Type Description
input_ids ContainerT

The input token ids.

labels ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

labels instance-attribute

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

TokenizerMapper dataclass

Bases: ABC

Tokenizes and builds the intermediate tensor components of a prompt.

Added in version 0.77.0. Base class for `InstructionTokenizerMapper` and `ChatTokenizerMapper`.

Methods:

Name Description
tokenize

Tokenize the text.

Attributes:

Name Type Description
tokenizer PreTrainedTokenizerBase

The LLM tokenizer to use.

tokenizer instance-attribute

The LLM tokenizer to use.

tokenize

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name Type Description Default

text

str

The text to tokenize.

required

Returns:

Type Description
torch.Tensor

An int64 tensor of token ids.

TrainMapper dataclass

Formats the undifferentiated transformers.PreTrainedModel input for training.

Added in version 0.77.0. Renamed `InstructionTrainMapper` to `TrainMapper`.

Classes:

Name Description
TrainInput

Input for transformers.PreTrainedModel training.

TrainInput

Bases: TypedDict, Generic[ContainerT]

Input for transformers.PreTrainedModel training.

Attributes:

Name Type Description
input_ids ContainerT

The input token ids.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

TransformLayerChatFormatMapper dataclass

Bases: TransformLayerFormatMapper, ChatFormatMapper

Builds the noise token mask for a chat prompt, which is required for training a TransformLayer.

Added in version 0.77.0.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerFormatMapper dataclass

Base class for building noise token mask.

Parameters:

Name Type Description Default

transform_all_tokens

bool

Whether to to transform all the tokens, or only the instruction, context, and possibly the system prompt.

False

Added in version 0.77.0. Base class for `TransformLayerInstructionFormatMapper` and `TransformLayerChatFormatMapper`.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerInstructionFormatMapper dataclass

Bases: TransformLayerFormatMapper, InstructionFormatMapper

Builds the noise token mask for a instruction prompt, which is required for training a TransformLayer.

Classes:

Name Description
PromptIndices

Indices of the prompt components in the input_ids tensor.

Methods:

Name Description
__call__

PromptIndices

Bases: TypedDict

Indices of the prompt components in the input_ids tensor.

Can be used to extract the prompt components from the input_ids tensor by slicing along the sequence dimension.

Examples:

Using the PromptIndices to extract the instruction from the input_ids tensor:

>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
...     "special_tokens": {
...         "bos": torch.tensor([[1]]),
...         "instruction_start": torch.tensor([[2]]),
...         "system_prompt_start": torch.tensor([[3]]),
...         "system_prompt_end": torch.tensor([[4]]),
...         "context_start": torch.tensor([[5]]),
...         "instruction_end": torch.tensor([[6]]),
...         "eos": torch.tensor([[7]]),
...     },
...     "schema_tokens": {
...         "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
...         "response": torch.tensor([[13, 14, 15]]),
...         "system_prompt": torch.tensor([[16, 17, 18, 19]]),
...         "context": torch.tensor([[20, 21, 22]]),
...     },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
...     sample["schema_tokens"]["instruction"],
...     formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )

Attributes:

Name Type Description
context slice

The slice of input_ids containing the context.

instruction slice

The slice of input_ids containing the instruction.

system_prompt slice

The slice of input_ids containing the system prompt.

context instance-attribute

context: slice

The slice of input_ids containing the context.

instruction instance-attribute

instruction: slice

The slice of input_ids containing the instruction.

system_prompt instance-attribute

system_prompt: slice

The slice of input_ids containing the system prompt.

__call__

__call__(
    sample: PromptTokens,
) -> UndifferentiatedTransformLayerInput

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerPreTrainFormatMapper dataclass

Bases: TransformLayerFormatMapper, PreTrainFormatMapper

Builds the noise token mask for a pretraining scenario which does not use a templated prompt, which is required for training a TransformLayer.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

TransformLayerTestMapper dataclass

Bases: TestMapper

Formats the undifferentiated InstructionTransformLayer input for testing.

Added in version 0.77.0. Renamed `TransformLayerInstructionTestMapper` to `TransformLayerTestMapper`.

Classes:

Name Description
TestInput

Input for LlamaForCausalLM testing.

TransformLayerTestInput

Input for InstructionTransformLayer testing.

TestInput

Bases: TypedDict, Generic[ContainerT]

Input for LlamaForCausalLM testing.

Attributes:

Name Type Description
input_ids ContainerT

The input token ids.

labels ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

labels instance-attribute

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

TransformLayerTestInput

Bases: TestInput[ContainerT]

Input for InstructionTransformLayer testing.

Attributes:

Name Type Description
input_ids ContainerT

The input token ids.

labels ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

noise_mask ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

labels instance-attribute

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

noise_mask instance-attribute

noise_mask: ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

TransformLayerTrainMapper dataclass

Bases: TrainMapper

Formats the undifferentiated InstructionTransformLayer input for training.

Added in version 0.77.0. Renamed `TransformLayerInstructionTrainMapper` to `TransformLayerTrainMapper`.

Classes:

Name Description
TrainInput

Input for transformers.PreTrainedModel training.

TransformLayerTrainInput

Input for TransformLayer training.

Methods:

Name Description
__call__

Attributes:

Name Type Description
ignore_prompt_loss bool

Whether to ignore the loss on the prompt tokens.

ignore_prompt_loss class-attribute instance-attribute

ignore_prompt_loss: bool = True

Whether to ignore the loss on the prompt tokens.

TrainInput

Bases: TypedDict, Generic[ContainerT]

Input for transformers.PreTrainedModel training.

Attributes:

Name Type Description
input_ids ContainerT

The input token ids.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

TransformLayerTrainInput

Bases: TrainInput[ContainerT]

Input for TransformLayer training.

Attributes:

Name Type Description
input_ids ContainerT

The input token ids.

loss_mask ContainerT

The mask that dictates which tokens in input_ids to use to calculate the loss.

noise_mask ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

input_ids instance-attribute

input_ids: ContainerT

The input token ids.

loss_mask instance-attribute

loss_mask: ContainerT

The mask that dictates which tokens in input_ids to use to calculate the loss.

noise_mask instance-attribute

noise_mask: ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

__call__

__call__(
    sample: UndifferentiatedTransformLayerInput,
) -> TransformLayerTrainInput[torch.Tensor]

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

UndifferentiatedInput

Bases: TypedDict

Formatted input for the transformers.PreTrainedModel that must be further formatted into either training or testing input.

Must be further formatted based on if the model is being trained or evaluated.

Added in version 0.77.0. Renamed `InstructionFormatMapper.UndifferentiatedInstructionInput` to `UndifferentiatedInput`.

Attributes:

Name Type Description
input_ids Tensor

The input token ids.

response NotRequired[Tensor]

The expected model response to the input_ids.

input_ids instance-attribute

input_ids: Tensor

The input token ids.

response instance-attribute

response: NotRequired[Tensor]

The expected model response to the input_ids.

UndifferentiatedTransformLayerInput

Bases: UndifferentiatedInput

Formatted input for the TransformLayer that must be further formatted into either training or testing input.

Must be further formatted based on if the model is being trained or evaluated.

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Added in version 0.77.0. Renamed `TransformLayerInstructionFormatMapper.UndifferentiatedTransformLayerInstructionInput` to `UndifferentiatedTransformLayerInput`.

Attributes:

Name Type Description
input_ids Tensor

The input token ids.

noise_mask Tensor

The mask that dictates which tokens in input_ids to obfuscate.

response NotRequired[Tensor]

The expected model response to the input_ids.

input_ids instance-attribute

input_ids: Tensor

The input token ids.

noise_mask instance-attribute

noise_mask: Tensor

The mask that dictates which tokens in input_ids to obfuscate.

response instance-attribute

response: NotRequired[Tensor]

The expected model response to the input_ids.