data_collator

Classes:

Name	Description
`DataCollatorForStainedGlassSeq2Seq`	Collates batches of sequences for training a sequence-to-sequence model with StainedGlass.

DataCollatorForStainedGlassSeq2Seq `dataclass` ¶

Bases: Generic[FeatureTypes_contra, InputTypes_co]

Collates batches of sequences for training a sequence-to-sequence model with StainedGlass.

Added in version 0.84.0.

Methods:

Name	Description
`pad`	Pack a list or tuple of variable length tensors into a single 2D tensor, padding with the given value.

Attributes:

Name	Type	Description
`max_length`	`int \| None`	The length to truncate the sequences to. If `None`, will pad to the maximum length sequence. If `pad_to_multiple_of` is set, must
`pad_to_multiple_of`	`int \| None`	If set, will pad the sequences to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on
`tokenizer`	`PreTrainedTokenizerBase`	The tokenizer to use to configure padding and truncation. Important attributes are `pad_token_id`, `padding_side`, and

max_length `class-attribute` `instance-attribute` ¶

max_length: int | None = None

The length to truncate the sequences to. If None, will pad to the maximum length sequence. If pad_to_multiple_of is set, must be a multiple of that value.

pad_to_multiple_of `class-attribute` `instance-attribute` ¶

pad_to_multiple_of: int | None = None

If set, will pad the sequences to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

Workloads must use mixed precision to take advantage of Tensor Cores. Due to their design, Tensor Cores have shape constraints on their inputs. In practice, for mixed precision training, NVIDIA's recommendations are:

Choose mini-batch to be a multiple of 8
Choose linear layer dimensions to be a multiple of 8
Choose convolution layer channel counts to be a multiple of 8
For classification problems, pad vocabulary to be a multiple of 8
For sequence problems, pad the sequence length to be a multiple of 8

See: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensor-core-shape.

tokenizer `instance-attribute` ¶

tokenizer: PreTrainedTokenizerBase

The tokenizer to use to configure padding and truncation. Important attributes are pad_token_id, padding_side, and truncation_side.

pad ¶

pad(
    sequences: list[Tensor] | tuple[Tensor, ...],
    padding_value: float,
) -> torch.Tensor

Pack a list or tuple of variable length tensors into a single 2D tensor, padding with the given value.

Parameters:

Name	Type	Description	Default
`sequences` ¶	`list[Tensor] \| tuple[Tensor, ...]`	The sequences to pad.	required
`padding_value` ¶	`float`	The value to use for padding.	required

Returns:

Type	Description
`torch.Tensor`	A 2D tensor of padded sequences.

FeatureType ¶

Bases: TypedDict

A dictionary that contains the tokenized input IDs, noise mask, and loss mask.

InputType ¶

Bases: TypedDict

A dictionary that contains the tokenized input IDs, attention mask, noise mask, and loss mask.

data_collator

DataCollatorForStainedGlassSeq2Seq `dataclass` ¶

max_length `class-attribute` `instance-attribute` ¶

pad_to_multiple_of `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

pad ¶

`sequences` ¶

`padding_value` ¶

FeatureType ¶

InputType ¶

data_collator

DataCollatorForStainedGlassSeq2Seq dataclass ¶

max_length class-attribute instance-attribute ¶

pad_to_multiple_of class-attribute instance-attribute ¶

tokenizer instance-attribute ¶

pad ¶

sequences ¶

padding_value ¶

FeatureType ¶

InputType ¶

DataCollatorForStainedGlassSeq2Seq `dataclass` ¶

max_length `class-attribute` `instance-attribute` ¶

pad_to_multiple_of `class-attribute` `instance-attribute` ¶

tokenizer `instance-attribute` ¶

`sequences` ¶

`padding_value` ¶