data_collator
Classes:
Name | Description |
---|---|
DataCollatorForStainedGlassSeq2Seq |
Collates batches of sequences for training a sequence-to-sequence model with StainedGlass. |
DataCollatorForStainedGlassSeq2Seq
dataclass
¶
Bases: Generic[FeatureTypes_contra, InputTypes_co]
Collates batches of sequences for training a sequence-to-sequence model with StainedGlass.
Added in version 0.84.0.
Methods:
Name | Description |
---|---|
pad |
Pack a list or tuple of variable length tensors into a single 2D tensor, padding with the given value. |
Attributes:
Name | Type | Description |
---|---|---|
max_length |
int | None
|
The length to truncate the sequences to. If |
pad_to_multiple_of |
int | None
|
If set, will pad the sequences to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on |
tokenizer |
PreTrainedTokenizerBase
|
The tokenizer to use to configure padding and truncation. Important attributes are |
max_length
class-attribute
instance-attribute
¶
max_length: int | None = None
The length to truncate the sequences to. If None
, will pad to the maximum length sequence. If pad_to_multiple_of
is set, must
be a multiple of that value.
pad_to_multiple_of
class-attribute
instance-attribute
¶
pad_to_multiple_of: int | None = None
If set, will pad the sequences to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
Workloads must use mixed precision to take advantage of Tensor Cores. Due to their design, Tensor Cores have shape constraints on their inputs. In practice, for mixed precision training, NVIDIA's recommendations are:
- Choose mini-batch to be a multiple of 8
- Choose linear layer dimensions to be a multiple of 8
- Choose convolution layer channel counts to be a multiple of 8
- For classification problems, pad vocabulary to be a multiple of 8
- For sequence problems, pad the sequence length to be a multiple of 8
See: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensor-core-shape.
tokenizer
instance-attribute
¶
tokenizer: PreTrainedTokenizerBase
The tokenizer to use to configure padding and truncation. Important attributes are pad_token_id
, padding_side
, and
truncation_side
.
pad
¶
Pack a list or tuple of variable length tensors into a single 2D tensor, padding with the given value.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
list[Tensor] | tuple[Tensor, ...]
|
The sequences to pad. |
required |
|
float
|
The value to use for padding. |
required |
Returns:
Type | Description |
---|---|
torch.Tensor
|
A 2D tensor of padded sequences. |