universal

Model-agnostic Mapper classes (designed to be compatible with datasets.Dataset.map) useful for building LLM prompts for Stained Glass Transform training and testing.

Classes:

Name	Description
`ChatFormatMapper`	Builds the tensor components of the transformers.PreTrainedModel chat prompt.
`ChatRoleStrings`	Role strings of a chat prompt.
`ChatSchemaMapper`	Maps samples from an arbitrary dataset to a universal schema for building an LLM chat prompt.
`ChatSpecialStrings`	Special string components of a chat prompt.
`ChatTokenizerMapper`	Tokenizes and builds the intermediate tensor components of a chat prompt.
`InstructionFormatMapper`	Builds the tensor components of the transformers.PreTrainedModel instruction prompt.
`InstructionSchemaMapper`	Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.
`InstructionSpecialStrings`	Special string components of an instruction-tuning prompt.
`InstructionTokenizerMapper`	Tokenizes and builds the intermediate tensor components of an instruction prompt.
`PreTrainFormatMapper`	Builds the tensor components of the transformers.PreTrainedModel pretraining prompt.
`PreTrainSchemaMapper`	Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.
`PreTrainTokenizerMapper`	Tokenizes and builds the intermediate tensor components of a pretraining input which does not have a prompt.
`SchemaMapper`	Maps samples from an arbitrary dataset to a universal schema for building an LLM prompt.
`TensorToListMapper`	Maps a dictionary of int64 tensors to a dictionary of lists of `int`.
`TestMapper`	Formats the undifferentiated `LlamaForCausalLM` input for testing.
`TokenizerMapper`	Tokenizes and builds the intermediate tensor components of a prompt.
`TrainMapper`	Formats the undifferentiated transformers.PreTrainedModel input for training.
`TransformLayerChatFormatMapper`	Builds the noise token mask for a chat prompt, which is required for training a `TransformLayer`.
`TransformLayerFormatMapper`	Base class for building noise token mask.
`TransformLayerInstructionFormatMapper`	Builds the noise token mask for a instruction prompt, which is required for training a `TransformLayer`.
`TransformLayerPreTrainFormatMapper`	Builds the noise token mask for a pretraining scenario which does not use a templated prompt, which is required for training a
`TransformLayerTestMapper`	Formats the undifferentiated `InstructionTransformLayer` input for testing.
`TransformLayerTrainMapper`	Formats the undifferentiated `InstructionTransformLayer` input for training.
`UndifferentiatedInput`	Formatted input for the transformers.PreTrainedModel that must be further formatted into either training or testing input.
`UndifferentiatedTransformLayerInput`	Formatted input for the `TransformLayer` that must be further formatted into

ChatFormatMapper `dataclass` ¶

Builds the tensor components of the transformers.PreTrainedModel chat prompt.

Added in version 0.77.0.

ChatRoleStrings `dataclass` ¶

Role strings of a chat prompt.

Added in version 0.77.0.

Attributes:

Name	Type	Description
`ASSISTANT_ROLE`	`Final[str]`	The assistant role.
`SYSTEM_ROLE`	`Final[str]`	The system role.
`USER_ROLE`	`Final[str]`	The user role.

ASSISTANT_ROLE `class-attribute` `instance-attribute` ¶

ASSISTANT_ROLE: Final[str] = 'assistant'

The assistant role.

SYSTEM_ROLE `class-attribute` `instance-attribute` ¶

SYSTEM_ROLE: Final[str] = 'system'

The system role.

USER_ROLE `class-attribute` `instance-attribute` ¶

USER_ROLE: Final[str] = 'user'

The user role.

ChatSchemaMapper `dataclass` ¶

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM chat prompt.

Either define a subclass for easier reuse, or use this directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
...     "response": "Paris",
...     "system_prompt": "Answer the following question:",
... }
>>> mapper = ChatSchemaMapper(
...     instruction_key="question",
...     response_key="response",
...     system_prompt_key="system_prompt",
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
[{'role': 'system', 'content': 'Answer the following question:'}, {'role': 'user', 'content': 'What is the capital of France?'}, {'role': 'assistant', 'content': 'Paris'}]

Added in version 0.77.0.

Classes:

Name	Description
`Schema`	Universal schema for building an LLM chat prompt.

Attributes:

Name	Type	Description
`instruction_key`	`str`	The dataset key/column corresponding to the input.
`response_key`	`str \| None`	An optional dataset key/column corresponding to the expected model response to the instruction.
`system_prompt_key`	`str \| None`	An optional dataset key/column corresponding to the system prompt for the model.

instruction_key `instance-attribute` ¶

instruction_key: str

The dataset key/column corresponding to the input.

response_key `instance-attribute` ¶

response_key: str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key `instance-attribute` ¶

system_prompt_key: str | None

An optional dataset key/column corresponding to the system prompt for the model.

Schema ¶

Bases: Schema

Universal schema for building an LLM chat prompt.

Added in version 0.77.0.

Attributes:

Name	Type	Description
`content`	`str`	The content of the message.
`role`	`str`	The role of the message.

content `instance-attribute` ¶

content: str

The content of the message.

role `instance-attribute` ¶

role: str

The role of the message.

ChatSpecialStrings `dataclass` ¶

Special string components of a chat prompt.

An instance of this class is expected to be defined for each model to dictate the structure of its prompt.

Added in version 0.77.0.

Attributes:

Name	Type	Description
`MESSAGE_END`	`Final[str]`	The end of a message.
`ROLES`	`Final[ChatRoleStrings]`	The role strings of a chat prompt.
`ROLE_HEADER_END`	`Final[str]`	The end of a role header.
`ROLE_HEADER_START`	`Final[str]`	The start of a role header.

MESSAGE_END `instance-attribute` ¶

MESSAGE_END: Final[str]

The end of a message.

ROLES `instance-attribute` ¶

ROLES: Final[ChatRoleStrings]

The role strings of a chat prompt.

ROLE_HEADER_END `instance-attribute` ¶

ROLE_HEADER_END: Final[str]

The end of a role header.

ROLE_HEADER_START `instance-attribute` ¶

ROLE_HEADER_START: Final[str]

The start of a role header.

ChatTokenizerMapper `dataclass` ¶

Bases: TokenizerMapper, ABC

Tokenizes and builds the intermediate tensor components of a chat prompt.

Added in version 0.77.0.

Classes:

Name	Description
`PromptTokens`	Collection of all tokenized components of the prompt.
`SchemaTokens`	Tokenized intermediate prompt schema.
`SpecialTokens`	Tokenized special components of the prompt.

Methods:

Name	Description
`tokenize`	Tokenize the text.

Attributes:

Name	Type	Description
`special_strings`	`ChatSpecialStrings`	The special prompt strings to use.
`special_tokens`	`SpecialTokens`	The tokenized special prompt strings.
`tokenizer`	`PreTrainedTokenizerBase`	The LLM tokenizer to use.

special_strings `class-attribute` `instance-attribute` ¶

special_strings: ChatSpecialStrings = field(init=False)

The special prompt strings to use.

special_tokens `class-attribute` `instance-attribute` ¶

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

PromptTokens ¶

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name	Type	Description
`schema_tokens`	`list[SchemaTokens]`	The tokenized schema components of the prompt.
`special_tokens`	`SpecialTokens`	The tokenized special components of the prompt.

schema_tokens `instance-attribute` ¶

schema_tokens: list[SchemaTokens]

The tokenized schema components of the prompt.

special_tokens `instance-attribute` ¶

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens ¶

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name	Type	Description
`content`	`Tensor`	The content of the message.
`role`	`Tensor`	The role of the message.

content `instance-attribute` ¶

content: Tensor

The content of the message.

role `instance-attribute` ¶

role: Tensor

The role of the message.

SpecialTokens ¶

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name	Type	Description
`assistant_role`	`Tensor`	The assistant role.
`bos`	`Tensor`	The beginning of string token.
`message_end`	`Tensor`	The end of a message.
`role_header_end`	`Tensor`	The end of the role header.
`role_header_start`	`Tensor`	The start of the role header.
`system_role`	`Tensor`	The system role.
`user_role`	`Tensor`	The user role.

assistant_role `instance-attribute` ¶

assistant_role: Tensor

The assistant role.

bos `instance-attribute` ¶

bos: Tensor

The beginning of string token.

message_end `instance-attribute` ¶

message_end: Tensor

The end of a message.

role_header_end `instance-attribute` ¶

role_header_end: Tensor

The end of the role header.

role_header_start `instance-attribute` ¶

role_header_start: Tensor

The start of the role header.

system_role `instance-attribute` ¶

system_role: Tensor

The system role.

user_role `instance-attribute` ¶

user_role: Tensor

The user role.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text` ¶	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

InstructionFormatMapper `dataclass` ¶

Builds the tensor components of the transformers.PreTrainedModel instruction prompt.

Classes:

Name	Description
`PromptIndices`	Indices of the prompt components in the `input_ids` tensor.

PromptIndices ¶

Bases: TypedDict

Indices of the prompt components in the input_ids tensor.

Can be used to extract the prompt components from the input_ids tensor by slicing along the sequence dimension.

Examples:

Using the PromptIndices to extract the instruction from the input_ids tensor:

>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
...     "special_tokens": {
...         "bos": torch.tensor([[1]]),
...         "instruction_start": torch.tensor([[2]]),
...         "system_prompt_start": torch.tensor([[3]]),
...         "system_prompt_end": torch.tensor([[4]]),
...         "context_start": torch.tensor([[5]]),
...         "instruction_end": torch.tensor([[6]]),
...         "eos": torch.tensor([[7]]),
...     },
...     "schema_tokens": {
...         "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
...         "response": torch.tensor([[13, 14, 15]]),
...         "system_prompt": torch.tensor([[16, 17, 18, 19]]),
...         "context": torch.tensor([[20, 21, 22]]),
...     },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
...     sample["schema_tokens"]["instruction"],
...     formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )

Attributes:

Name	Type	Description
`context`	`slice`	The slice of `input_ids` containing the context.
`instruction`	`slice`	The slice of `input_ids` containing the instruction.
`system_prompt`	`slice`	The slice of `input_ids` containing the system prompt.

context `instance-attribute` ¶

context: slice

The slice of input_ids containing the context.

instruction `instance-attribute` ¶

instruction: slice

The slice of input_ids containing the instruction.

system_prompt `instance-attribute` ¶

system_prompt: slice

The slice of input_ids containing the system prompt.

InstructionSchemaMapper `dataclass` ¶

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

Either define a subclass for easier reuse, or use this class directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
...     "response": "Paris",
...     "system_prompt": "Answer the following question:",
... }
>>> mapper = InstructionSchemaMapper(
...     instruction_key="question",
...     response_key="response",
...     system_prompt_key="system_prompt",
...     context_key=None,
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
{'instruction': 'What is the capital of France?', 'response': 'Paris', 'context': '', 'system_prompt': 'Answer the following question:'}

Classes:

Name	Description
`Schema`	Universal schema for building an LLM instruction prompt.

Attributes:

Name	Type	Description
`context_key`	`str \| None`	An optional dataset key/column corresponding to context to append to the instruction.
`instruction_key`	`str`	The dataset key/column corresponding to the input.
`response_key`	`str \| None`	An optional dataset key/column corresponding to the expected model response to the instruction.
`system_prompt_key`	`str \| None`	An optional dataset key/column corresponding to the system prompt for the model.

context_key `instance-attribute` ¶

context_key: str | None

An optional dataset key/column corresponding to context to append to the instruction.

instruction_key `instance-attribute` ¶

instruction_key: str

The dataset key/column corresponding to the input.

response_key `instance-attribute` ¶

response_key: str | None

An optional dataset key/column corresponding to the expected model response to the instruction.

system_prompt_key `instance-attribute` ¶

system_prompt_key: str | None

An optional dataset key/column corresponding to the system prompt for the model.

Schema ¶

Bases: Schema

Universal schema for building an LLM instruction prompt.

Added in version 0.77.0. Renamed `InstructionSchema` to `InstructionSchemaMapper.Schema`.

Attributes:

Name	Type	Description
`context`	`str`	An optional context to append to the instruction.
`instruction`	`str`	The input to the model.
`response`	`str`	The optional expected model response to the instruction.
`system_prompt`	`str`	An optional system prompt for the model.

context `instance-attribute` ¶

context: str

An optional context to append to the instruction.

instruction `instance-attribute` ¶

instruction: str

The input to the model.

response `instance-attribute` ¶

response: str

The optional expected model response to the instruction.

system_prompt `instance-attribute` ¶

system_prompt: str

An optional system prompt for the model.

InstructionSpecialStrings `dataclass` ¶

Special string components of an instruction-tuning prompt.

An instance of this class is expected to be defined for each model to dictate the structure of its prompt.

Added in version 0.77.0. Renamed `SpecialStrings` to `InstructionSpecialStrings`.

Attributes:

Name	Type	Description
`CONTEXT_START`	`Final[str]`	The delimiter between the instruction and the context.
`INSTRUCTION_END`	`Final[str]`	The end of the instruction tag. The model is highly sensitive to this tag.
`INSTRUCTION_START`	`Final[str]`	The start of the instruction. The model is highly sensitive to this tag.
`SYSTEM_PROMPT_END`	`Final[str]`	The end of the system prompt.
`SYSTEM_PROMPT_START`	`Final[str]`	The start of the system prompt.

CONTEXT_START `instance-attribute` ¶

CONTEXT_START: Final[str]

The delimiter between the instruction and the context.

INSTRUCTION_END `instance-attribute` ¶

INSTRUCTION_END: Final[str]

The end of the instruction tag. The model is highly sensitive to this tag.

INSTRUCTION_START `instance-attribute` ¶

INSTRUCTION_START: Final[str]

The start of the instruction. The model is highly sensitive to this tag.

SYSTEM_PROMPT_END `instance-attribute` ¶

SYSTEM_PROMPT_END: Final[str]

The end of the system prompt.

SYSTEM_PROMPT_START `instance-attribute` ¶

SYSTEM_PROMPT_START: Final[str]

The start of the system prompt.

InstructionTokenizerMapper `dataclass` ¶

Bases: TokenizerMapper, ABC

Tokenizes and builds the intermediate tensor components of an instruction prompt.

Classes:

Name	Description
`PromptTokens`	Collection of all tokenized components of the prompt.
`SchemaTokens`	Tokenized intermediate prompt schema.
`SpecialTokens`	Tokenized special components of the prompt.

Methods:

Name	Description
`tokenize`	Tokenize the text.

Attributes:

Name	Type	Description
`always_include_context`	`bool`	Whether to always include the start of context tokens in the prompt, even if no context is provided.
`special_strings`	`InstructionSpecialStrings`	The special prompt strings to use.
`special_tokens`	`SpecialTokens`	The tokenized special prompt strings.
`tokenizer`	`PreTrainedTokenizerBase`	The LLM tokenizer to use.

always_include_context `class-attribute` `instance-attribute` ¶

always_include_context: bool = False

Whether to always include the start of context tokens in the prompt, even if no context is provided.

special_strings `class-attribute` `instance-attribute` ¶

special_strings: InstructionSpecialStrings = field(
    init=False
)

The special prompt strings to use.

special_tokens `class-attribute` `instance-attribute` ¶

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

PromptTokens ¶

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name	Type	Description
`schema_tokens`	`SchemaTokens`	The tokenized schema components of the prompt.
`special_tokens`	`SpecialTokens`	The tokenized special components of the prompt.

schema_tokens `instance-attribute` ¶

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens `instance-attribute` ¶

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens ¶

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name	Type	Description
`context`	`Tensor`	An optional context to append to the instruction.
`instruction`	`Tensor`	The input to the model.
`response`	`Tensor`	The expected model response to the instruction.
`system_prompt`	`Tensor`	An optional system prompt for the model.

context `instance-attribute` ¶

context: Tensor

An optional context to append to the instruction.

instruction `instance-attribute` ¶

instruction: Tensor

The input to the model.

response `instance-attribute` ¶

response: Tensor

The expected model response to the instruction.

system_prompt `instance-attribute` ¶

system_prompt: Tensor

An optional system prompt for the model.

SpecialTokens ¶

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name	Type	Description
`bos`	`Tensor`	The beginning of string token.
`context_start`	`Tensor`	The delimiter between the instruction and the context.
`eos`	`Tensor`	The end of string token.
`instruction_end`	`Tensor`	The end of the instruction tag.
`instruction_start`	`Tensor`	The start of the instruction tag.
`system_prompt_end`	`Tensor`	The end of the system prompt.
`system_prompt_start`	`Tensor`	The start of the system prompt.

bos `instance-attribute` ¶

bos: Tensor

The beginning of string token.

context_start `instance-attribute` ¶

context_start: Tensor

The delimiter between the instruction and the context.

eos `instance-attribute` ¶

eos: Tensor

The end of string token.

instruction_end `instance-attribute` ¶

instruction_end: Tensor

The end of the instruction tag.

instruction_start `instance-attribute` ¶

instruction_start: Tensor

The start of the instruction tag.

system_prompt_end `instance-attribute` ¶

system_prompt_end: Tensor

The end of the system prompt.

system_prompt_start `instance-attribute` ¶

system_prompt_start: Tensor

The start of the system prompt.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text` ¶	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

PreTrainFormatMapper `dataclass` ¶

Builds the tensor components of the transformers.PreTrainedModel pretraining prompt.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

PreTrainSchemaMapper `dataclass` ¶

Bases: SchemaMapper

Maps samples from an arbitrary dataset to a universal schema for building an LLM instruction prompt.

Either define a subclass for easier reuse, or use this class directly.

Examples:

>>> sample = {
...     "question": "What is the capital of France?",
... }
>>> mapper = PreTrainSchemaMapper(
...     instruction_key="question",
... )
>>> mapped_sample = mapper(sample)
>>> mapped_sample
{'text': 'What is the capital of France?'}

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

Classes:

Name	Description
`Schema`	Universal schema for building an LLM instruction prompt.

Attributes:

Name	Type	Description
`instruction_key`	`str`	The dataset key/column corresponding to the input.

instruction_key `instance-attribute` ¶

instruction_key: str

The dataset key/column corresponding to the input.

Schema ¶

Bases: Schema

Universal schema for building an LLM instruction prompt.

Attributes:

Name	Type	Description
`text`	`str`	The input to the model.

text `instance-attribute` ¶

text: str

The input to the model.

PreTrainTokenizerMapper `dataclass` ¶

Bases: TokenizerMapper

Tokenizes and builds the intermediate tensor components of a pretraining input which does not have a prompt.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

Classes:

Name	Description
`PromptTokens`	Collection of all tokenized components of the prompt.
`SchemaTokens`	Tokenized intermediate prompt schema.
`SpecialTokens`	Tokenized special components of the prompt.

Methods:

Name	Description
`tokenize`	Tokenize the text.

Attributes:

Name	Type	Description
`special_tokens`	`SpecialTokens`	The tokenized special prompt strings.
`tokenizer`	`PreTrainedTokenizerBase`	The LLM tokenizer to use.

special_tokens `class-attribute` `instance-attribute` ¶

special_tokens: SpecialTokens = field(init=False)

The tokenized special prompt strings.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

PromptTokens ¶

Bases: TypedDict

Collection of all tokenized components of the prompt.

Attributes:

Name	Type	Description
`schema_tokens`	`SchemaTokens`	The tokenized schema components of the prompt.
`special_tokens`	`SpecialTokens`	The tokenized special components of the prompt.

schema_tokens `instance-attribute` ¶

schema_tokens: SchemaTokens

The tokenized schema components of the prompt.

special_tokens `instance-attribute` ¶

special_tokens: SpecialTokens

The tokenized special components of the prompt.

SchemaTokens ¶

Bases: TypedDict

Tokenized intermediate prompt schema.

Attributes:

Name	Type	Description
`text`	`Tensor`	The input to the model.

text `instance-attribute` ¶

text: Tensor

The input to the model.

SpecialTokens ¶

Bases: TypedDict

Tokenized special components of the prompt.

Attributes:

Name	Type	Description
`bos`	`Tensor`	The beginning of string token.
`eos`	`Tensor`	The end of string token.

bos `instance-attribute` ¶

bos: Tensor

The beginning of string token.

eos `instance-attribute` ¶

eos: Tensor

The end of string token.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text` ¶	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

SchemaMapper `dataclass` ¶

Bases: ABC

Maps samples from an arbitrary dataset to a universal schema for building an LLM prompt.

Added in version 0.77.0. Base class for `InstructionSchemaMapper` and `ChatSchemaMapper`.

Classes:

Name	Description
`Schema`	Base schema for building an LLM prompt.

Attributes:

Name	Type	Description
`instruction_key`	`str`	The dataset key/column corresponding to the input.

instruction_key `instance-attribute` ¶

instruction_key: str

The dataset key/column corresponding to the input.

Schema ¶

Bases: TypedDict

Base schema for building an LLM prompt.

TensorToListMapper `dataclass` ¶

Maps a dictionary of int64 tensors to a dictionary of lists of int.

TestMapper `dataclass` ¶

Formats the undifferentiated LlamaForCausalLM input for testing.

Added in version 0.77.0. Renamed `InstructionTestMapper` to `TestMapper`.

Classes:

Name	Description
`TestInput`	Input for `LlamaForCausalLM` testing.

TestInput ¶

Bases: TypedDict, Generic[ContainerT]

Input for LlamaForCausalLM testing.

Attributes:

Name	Type	Description
`input_ids`	`ContainerT`	The input token ids.
`labels`	`ContainerT`	The expected model response to the `input_ids`. When pretraining, the `input_ids` are used as the labels.

input_ids `instance-attribute` ¶

input_ids: ContainerT

The input token ids.

labels `instance-attribute` ¶

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

TokenizerMapper `dataclass` ¶

Bases: ABC

Tokenizes and builds the intermediate tensor components of a prompt.

Added in version 0.77.0. Base class for `InstructionTokenizerMapper` and `ChatTokenizerMapper`.

Methods:

Name	Description
`tokenize`	Tokenize the text.

Attributes:

Name	Type	Description
`tokenizer`	`PreTrainedTokenizerBase`	The LLM tokenizer to use.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The LLM tokenizer to use.

tokenize ¶

tokenize(text: str) -> torch.Tensor

Tokenize the text.

Parameters:

Name	Type	Description	Default
`text` ¶	`str`	The text to tokenize.	required

Returns:

Type	Description
`torch.Tensor`	An int64 tensor of token ids.

TrainMapper `dataclass` ¶

Formats the undifferentiated transformers.PreTrainedModel input for training.

Added in version 0.77.0. Renamed `InstructionTrainMapper` to `TrainMapper`.

Classes:

Name	Description
`TrainInput`	Input for transformers.PreTrainedModel training.

TrainInput ¶

Bases: TypedDict, Generic[ContainerT]

Input for transformers.PreTrainedModel training.

Attributes:

Name	Type	Description
`input_ids`	`ContainerT`	The input token ids.

input_ids `instance-attribute` ¶

input_ids: ContainerT

The input token ids.

TransformLayerChatFormatMapper `dataclass` ¶

Bases: TransformLayerFormatMapper, ChatFormatMapper

Builds the noise token mask for a chat prompt, which is required for training a TransformLayer.

Added in version 0.77.0.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerFormatMapper `dataclass` ¶

Base class for building noise token mask.

Parameters:

Name	Type	Description	Default
`transform_all_tokens` ¶	`bool`	Whether to to transform all the tokens, or only the instruction, context, and possibly the system prompt.	`False`

Added in version 0.77.0. Base class for `TransformLayerInstructionFormatMapper` and `TransformLayerChatFormatMapper`.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerInstructionFormatMapper `dataclass` ¶

Bases: TransformLayerFormatMapper, InstructionFormatMapper

Builds the noise token mask for a instruction prompt, which is required for training a TransformLayer.

Classes:

Name	Description
`PromptIndices`	Indices of the prompt components in the `input_ids` tensor.

Methods:

Name	Description
`__call__`

PromptIndices ¶

Bases: TypedDict

Indices of the prompt components in the input_ids tensor.

Can be used to extract the prompt components from the input_ids tensor by slicing along the sequence dimension.

Examples:

Using the PromptIndices to extract the instruction from the input_ids tensor:

>>> mapper = InstructionFormatMapper()
>>> sample: universal.InstructionTokenizerMapper.PromptTokens = {
...     "special_tokens": {
...         "bos": torch.tensor([[1]]),
...         "instruction_start": torch.tensor([[2]]),
...         "system_prompt_start": torch.tensor([[3]]),
...         "system_prompt_end": torch.tensor([[4]]),
...         "context_start": torch.tensor([[5]]),
...         "instruction_end": torch.tensor([[6]]),
...         "eos": torch.tensor([[7]]),
...     },
...     "schema_tokens": {
...         "instruction": torch.tensor([[8, 9, 10, 11, 12]]),
...         "response": torch.tensor([[13, 14, 15]]),
...         "system_prompt": torch.tensor([[16, 17, 18, 19]]),
...         "context": torch.tensor([[20, 21, 22]]),
...     },
... }
>>> formatted_sample = mapper(sample)
>>> torch.testing.assert_close(
...     sample["schema_tokens"]["instruction"],
...     formatted_sample["input_ids"][:, mapper.prompt_indices["instruction"]],
... )

Attributes:

Name	Type	Description
`context`	`slice`	The slice of `input_ids` containing the context.
`instruction`	`slice`	The slice of `input_ids` containing the instruction.
`system_prompt`	`slice`	The slice of `input_ids` containing the system prompt.

context `instance-attribute` ¶

context: slice

The slice of input_ids containing the context.

instruction `instance-attribute` ¶

instruction: slice

The slice of input_ids containing the instruction.

system_prompt `instance-attribute` ¶

system_prompt: slice

The slice of input_ids containing the system prompt.

call ¶

__call__(
    sample: PromptTokens,
) -> UndifferentiatedTransformLayerInput

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Changed in version 0.100.0: Removed the option of passing `obfuscate_system_prompt` to the TokenizerWrapper

TransformLayerPreTrainFormatMapper `dataclass` ¶

Bases: TransformLayerFormatMapper, PreTrainFormatMapper

Builds the noise token mask for a pretraining scenario which does not use a templated prompt, which is required for training a TransformLayer.

Added in version 0.77.0. Added support for pretraining which does not use a prompt template.

TransformLayerTestMapper `dataclass` ¶

Bases: TestMapper

Formats the undifferentiated InstructionTransformLayer input for testing.

Added in version 0.77.0. Renamed `TransformLayerInstructionTestMapper` to `TransformLayerTestMapper`.

Classes:

Name	Description
`TestInput`	Input for `LlamaForCausalLM` testing.
`TransformLayerTestInput`	Input for `InstructionTransformLayer` testing.

TestInput ¶

Bases: TypedDict, Generic[ContainerT]

Input for LlamaForCausalLM testing.

Attributes:

Name	Type	Description
`input_ids`	`ContainerT`	The input token ids.
`labels`	`ContainerT`	The expected model response to the `input_ids`. When pretraining, the `input_ids` are used as the labels.

input_ids `instance-attribute` ¶

input_ids: ContainerT

The input token ids.

labels `instance-attribute` ¶

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

TransformLayerTestInput ¶

Bases: TestInput[ContainerT]

Input for InstructionTransformLayer testing.

Attributes:

Name	Type	Description
`input_ids`	`ContainerT`	The input token ids.
`labels`	`ContainerT`	The expected model response to the `input_ids`. When pretraining, the `input_ids` are used as the labels.
`noise_mask`	`ContainerT`	The mask that dictates which tokens in `input_ids` to obfuscate.

input_ids `instance-attribute` ¶

input_ids: ContainerT

The input token ids.

labels `instance-attribute` ¶

labels: ContainerT

The expected model response to the input_ids. When pretraining, the input_ids are used as the labels.

noise_mask `instance-attribute` ¶

noise_mask: ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

TransformLayerTrainMapper `dataclass` ¶

Bases: TrainMapper

Formats the undifferentiated InstructionTransformLayer input for training.

Added in version 0.77.0. Renamed `TransformLayerInstructionTrainMapper` to `TransformLayerTrainMapper`.

Classes:

Name	Description
`TrainInput`	Input for transformers.PreTrainedModel training.
`TransformLayerTrainInput`	Input for `TransformLayer` training.

Methods:

Name	Description
`__call__`

Attributes:

Name	Type	Description
`ignore_prompt_loss`	`bool`	Whether to ignore the loss on the prompt tokens.

ignore_prompt_loss `class-attribute` `instance-attribute` ¶

ignore_prompt_loss: bool = True

Whether to ignore the loss on the prompt tokens.

TrainInput ¶

Bases: TypedDict, Generic[ContainerT]

Input for transformers.PreTrainedModel training.

Attributes:

Name	Type	Description
`input_ids`	`ContainerT`	The input token ids.

input_ids `instance-attribute` ¶

input_ids: ContainerT

The input token ids.

TransformLayerTrainInput ¶

Bases: TrainInput[ContainerT]

Input for TransformLayer training.

Attributes:

Name	Type	Description
`input_ids`	`ContainerT`	The input token ids.
`loss_mask`	`ContainerT`	The mask that dictates which tokens in `input_ids` to use to calculate the loss.
`noise_mask`	`ContainerT`	The mask that dictates which tokens in `input_ids` to obfuscate.

input_ids `instance-attribute` ¶

input_ids: ContainerT

The input token ids.

loss_mask `instance-attribute` ¶

loss_mask: ContainerT

The mask that dictates which tokens in input_ids to use to calculate the loss.

noise_mask `instance-attribute` ¶

noise_mask: ContainerT

The mask that dictates which tokens in input_ids to obfuscate.

call ¶

__call__(
    sample: UndifferentiatedTransformLayerInput,
) -> TransformLayerTrainInput[torch.Tensor]

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

UndifferentiatedInput ¶

Bases: TypedDict

Formatted input for the transformers.PreTrainedModel that must be further formatted into either training or testing input.

Must be further formatted based on if the model is being trained or evaluated.

Added in version 0.77.0. Renamed `InstructionFormatMapper.UndifferentiatedInstructionInput` to `UndifferentiatedInput`.

Attributes:

Name	Type	Description
`input_ids`	`Tensor`	The input token ids.
`response`	`NotRequired[Tensor]`	The expected model response to the `input_ids`.

input_ids `instance-attribute` ¶

input_ids: Tensor

The input token ids.

response `instance-attribute` ¶

response: NotRequired[Tensor]

The expected model response to the input_ids.

UndifferentiatedTransformLayerInput ¶

Bases: UndifferentiatedInput

Formatted input for the TransformLayer that must be further formatted into either training or testing input.

Must be further formatted based on if the model is being trained or evaluated.

Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.

Added in version 0.77.0. Renamed `TransformLayerInstructionFormatMapper.UndifferentiatedTransformLayerInstructionInput` to `UndifferentiatedTransformLayerInput`.

Attributes:

Name	Type	Description
`input_ids`	`Tensor`	The input token ids.
`noise_mask`	`Tensor`	The mask that dictates which tokens in `input_ids` to obfuscate.
`response`	`NotRequired[Tensor]`	The expected model response to the `input_ids`.

input_ids `instance-attribute` ¶

input_ids: Tensor

The input token ids.

noise_mask `instance-attribute` ¶

noise_mask: Tensor

The mask that dictates which tokens in input_ids to obfuscate.

response `instance-attribute` ¶

response: NotRequired[Tensor]

The expected model response to the input_ids.

universal

ChatFormatMapper dataclass ¶

ChatRoleStrings dataclass ¶

ASSISTANT_ROLE class-attribute instance-attribute ¶

SYSTEM_ROLE class-attribute instance-attribute ¶

USER_ROLE class-attribute instance-attribute ¶

ChatSchemaMapper dataclass ¶

instruction_key instance-attribute ¶

response_key instance-attribute ¶

system_prompt_key instance-attribute ¶

Schema ¶

content instance-attribute ¶

role instance-attribute ¶

ChatSpecialStrings dataclass ¶

MESSAGE_END instance-attribute ¶

ROLES instance-attribute ¶

ROLE_HEADER_END instance-attribute ¶

ROLE_HEADER_START instance-attribute ¶

ChatTokenizerMapper dataclass ¶

special_strings class-attribute instance-attribute ¶

special_tokens class-attribute instance-attribute ¶

tokenizer instance-attribute ¶

PromptTokens ¶

schema_tokens instance-attribute ¶

special_tokens instance-attribute ¶

SchemaTokens ¶

content instance-attribute ¶

role instance-attribute ¶

SpecialTokens ¶

assistant_role instance-attribute ¶

bos instance-attribute ¶

message_end instance-attribute ¶

role_header_end instance-attribute ¶

role_header_start instance-attribute ¶

system_role instance-attribute ¶

user_role instance-attribute ¶

tokenize ¶

text ¶

InstructionFormatMapper dataclass ¶

PromptIndices ¶

context instance-attribute ¶

instruction instance-attribute ¶

system_prompt instance-attribute ¶

InstructionSchemaMapper dataclass ¶

context_key instance-attribute ¶

instruction_key instance-attribute ¶

response_key instance-attribute ¶

system_prompt_key instance-attribute ¶

Schema ¶

context instance-attribute ¶

instruction instance-attribute ¶

response instance-attribute ¶

system_prompt instance-attribute ¶

InstructionSpecialStrings dataclass ¶

CONTEXT_START instance-attribute ¶

INSTRUCTION_END instance-attribute ¶

INSTRUCTION_START instance-attribute ¶

SYSTEM_PROMPT_END instance-attribute ¶

SYSTEM_PROMPT_START instance-attribute ¶

InstructionTokenizerMapper dataclass ¶

always_include_context class-attribute instance-attribute ¶

special_strings class-attribute instance-attribute ¶

special_tokens class-attribute instance-attribute ¶

tokenizer instance-attribute ¶

PromptTokens ¶

schema_tokens instance-attribute ¶

special_tokens instance-attribute ¶

SchemaTokens ¶

context instance-attribute ¶

instruction instance-attribute ¶

response instance-attribute ¶

system_prompt instance-attribute ¶

SpecialTokens ¶

bos instance-attribute ¶

context_start instance-attribute ¶

eos instance-attribute ¶

instruction_end instance-attribute ¶

instruction_start instance-attribute ¶

system_prompt_end instance-attribute ¶

system_prompt_start instance-attribute ¶

ChatFormatMapper `dataclass` ¶

ChatRoleStrings `dataclass` ¶

ASSISTANT_ROLE `class-attribute` `instance-attribute` ¶

SYSTEM_ROLE `class-attribute` `instance-attribute` ¶

USER_ROLE `class-attribute` `instance-attribute` ¶

ChatSchemaMapper `dataclass` ¶

instruction_key `instance-attribute` ¶

response_key `instance-attribute` ¶

system_prompt_key `instance-attribute` ¶

content `instance-attribute` ¶

role `instance-attribute` ¶

ChatSpecialStrings `dataclass` ¶

MESSAGE_END `instance-attribute` ¶

ROLES `instance-attribute` ¶

ROLE_HEADER_END `instance-attribute` ¶

ROLE_HEADER_START `instance-attribute` ¶

ChatTokenizerMapper `dataclass` ¶

special_strings `class-attribute` `instance-attribute` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

content `instance-attribute` ¶

role `instance-attribute` ¶

assistant_role `instance-attribute` ¶

bos `instance-attribute` ¶

message_end `instance-attribute` ¶

role_header_end `instance-attribute` ¶

role_header_start `instance-attribute` ¶

system_role `instance-attribute` ¶

user_role `instance-attribute` ¶

`text` ¶

InstructionFormatMapper `dataclass` ¶

context `instance-attribute` ¶

instruction `instance-attribute` ¶

system_prompt `instance-attribute` ¶

InstructionSchemaMapper `dataclass` ¶

context_key `instance-attribute` ¶

instruction_key `instance-attribute` ¶

response_key `instance-attribute` ¶

system_prompt_key `instance-attribute` ¶

context `instance-attribute` ¶

instruction `instance-attribute` ¶

response `instance-attribute` ¶

system_prompt `instance-attribute` ¶

InstructionSpecialStrings `dataclass` ¶

CONTEXT_START `instance-attribute` ¶

INSTRUCTION_END `instance-attribute` ¶

INSTRUCTION_START `instance-attribute` ¶

SYSTEM_PROMPT_END `instance-attribute` ¶

SYSTEM_PROMPT_START `instance-attribute` ¶

InstructionTokenizerMapper `dataclass` ¶

always_include_context `class-attribute` `instance-attribute` ¶

special_strings `class-attribute` `instance-attribute` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

context `instance-attribute` ¶

instruction `instance-attribute` ¶

response `instance-attribute` ¶

system_prompt `instance-attribute` ¶

bos `instance-attribute` ¶

context_start `instance-attribute` ¶

eos `instance-attribute` ¶

instruction_end `instance-attribute` ¶

instruction_start `instance-attribute` ¶

system_prompt_end `instance-attribute` ¶

system_prompt_start `instance-attribute` ¶

`text` ¶

PreTrainFormatMapper `dataclass` ¶

PreTrainSchemaMapper `dataclass` ¶

instruction_key `instance-attribute` ¶

text `instance-attribute` ¶

PreTrainTokenizerMapper `dataclass` ¶

special_tokens `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

schema_tokens `instance-attribute` ¶

special_tokens `instance-attribute` ¶

text `instance-attribute` ¶