Skip to content

data_collator

Classes:

Name Description
DataCollatorForStainedGlassSeq2Seq

Collates batches of sequences for training a sequence-to-sequence model with StainedGlass.

DataCollatorForStainedGlassSeq2Seq dataclass

Bases: Generic[FeatureTypes_contra, InputTypes_co]

Collates batches of sequences for training a sequence-to-sequence model with StainedGlass.

Added in version 0.84.0.

Methods:

Name Description
pad

Pack a list or tuple of variable length tensors into a single 2D tensor, padding with the given value.

Attributes:

Name Type Description
max_length int | None

The length to truncate the sequences to. If None, will pad to the maximum length sequence. If pad_to_multiple_of is set, must

pad_to_multiple_of int | None

If set, will pad the sequences to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on

tokenizer PreTrainedTokenizerBase

The tokenizer to use to configure padding and truncation. Important attributes are pad_token_id, padding_side, and

max_length class-attribute instance-attribute

max_length: int | None = None

The length to truncate the sequences to. If None, will pad to the maximum length sequence. If pad_to_multiple_of is set, must be a multiple of that value.

pad_to_multiple_of class-attribute instance-attribute

pad_to_multiple_of: int | None = None

If set, will pad the sequences to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).

Workloads must use mixed precision to take advantage of Tensor Cores. Due to their design, Tensor Cores have shape constraints on their inputs. In practice, for mixed precision training, NVIDIA's recommendations are:

  1. Choose mini-batch to be a multiple of 8
  2. Choose linear layer dimensions to be a multiple of 8
  3. Choose convolution layer channel counts to be a multiple of 8
  4. For classification problems, pad vocabulary to be a multiple of 8
  5. For sequence problems, pad the sequence length to be a multiple of 8

See: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensor-core-shape.

tokenizer instance-attribute

The tokenizer to use to configure padding and truncation. Important attributes are pad_token_id, padding_side, and truncation_side.

pad

pad(
    sequences: list[Tensor] | tuple[Tensor, ...],
    padding_value: float,
) -> torch.Tensor

Pack a list or tuple of variable length tensors into a single 2D tensor, padding with the given value.

Parameters:

Name Type Description Default

sequences

list[Tensor] | tuple[Tensor, ...]

The sequences to pad.

required

padding_value

float

The value to use for padding.

required

Returns:

Type Description
torch.Tensor

A 2D tensor of padded sequences.

FeatureType

Bases: TypedDict

A dictionary that contains the tokenized input IDs, noise mask, and loss mask.

InputType

Bases: TypedDict

A dictionary that contains the tokenized input IDs, attention mask, noise mask, and loss mask.