noisy_transformer_masking_model
NoiseAttentionNoisyModelOutput
dataclass
¶
Bases: NoisyModelOutput[T]
A NoisyModelOutput
with the noise_attention_mask
used to apply the noise.
Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.
__init_subclass__
¶
Register subclasses as pytree nodes.
This is necessary to synchronize gradients when using torch.nn.parallel.DistributedDataParallel(static_graph=True)
with modules
that output ModelOutput
subclasses.
See: https://github.com/pytorch/pytorch/issues/106690.
to_tuple
¶
Convert self to a tuple containing all the attributes/keys that are not None
.
Returns:
Type | Description |
---|---|
tuple[Any, ...]
|
A tuple of all attributes/keys that are not |
NoiseMaskedNoisyTransformerModel
¶
Bases: NoisyTransformerModel[CausalModelT, NoiseLayerP, NoiseLayerT]
A NoisyTransformerModel
that adds noise to a portion of the inputs, excluding
any special tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
noise_layer_class |
NoiseLayerConstructor[NoiseLayerP, NoiseLayerT]
|
The type of noise that is added to the given model. |
required |
base_model |
CausalModelT
|
The model to add noise to. |
required |
input_shape |
tuple[int, ...]
|
The shape of the model input; used to infer the shape of the noise layer. |
required |
target_layer |
str
|
Name of the layer whose output noise will be added to. A submodule of the model may be specified by providing the
|
'input'
|
target_parameter |
str | None
|
If the target layer is the input, the keyword parameter to which noise is added (default: None). By default, noise is added to the first positional parameter of the model's forward method. |
None
|
truncated_layer_index |
int | None
|
The layer index to truncate the model at. |
None
|
*args |
args
|
Positional arguments to the |
required |
**kwargs |
kwargs
|
Keyword arguments to the |
required |
config
property
¶
config: PretrainedConfig
Return the config of the base model.
Returns:
Type | Description |
---|---|
PretrainedConfig
|
The config of the base model. |
is_truncated_and_offloaded
property
¶
is_truncated_and_offloaded: bool
Whether the model decoder layers are currently truncated.
single_precision_input_embeddings
cached
property
¶
single_precision_input_embeddings: Embedding
A single-precision copy of input embeddings.
target_parameter
property
¶
target_parameter: str | None
The base_model.forward
parameter to which noise is added.
target_parameter_index
cached
property
¶
target_parameter_index: int
The base_model.forward
parameter to which noise is added.
__init__
¶
__init__(noise_layer_class: NoiseLayerConstructor[NoiseLayerP, NoiseLayerT], base_model: CausalModelT, input_shape: tuple[int, ...], target_layer: str = 'input', target_parameter: str | None = None, truncated_layer_index: int | None = None, *args: args, **kwargs: kwargs) -> None
Changed in version 0.74.0: The `noise_token_mask` was renamed to `noise_mask` to create a uniform interface everywhere.
distillation_context
¶
Prepare the base model to facilitate distillation training by applying losses over the transformed and non-transformed activations.
Returns:
Type | Description |
---|---|
contextlib.ExitStack
|
A context manager that detaches the hooks when exited. |
Added in version 0.55.0.
forward
¶
Delegate calls to the base model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
args |
Any
|
Inputs to the base model. |
required |
kwargs |
Any
|
Keyword arguments to the base model. |
required |
Returns:
Type | Description |
---|---|
NoiseAttentionNoisyModelOutput
|
The result of the underlying model with noise added to the output of the base model's target layer. |
from_pretrained
classmethod
¶
from_pretrained(save_directory: str | Path, base_model_directory: str | Path | None = None, **kwargs: Any) -> Self
Load the model from save_pretrained
directory, and optionally load the base
model from a different directory.
Mirrors the from_pretrained
method of the huggingface transformers models so as
to be compatible with their api calls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
save_directory |
str | Path
|
The path to the saved model. |
required |
base_model_directory |
str | Path | None
|
The path to the saved base model, if not the same as |
None
|
**kwargs |
Any
|
Keyword arguments to pass to the base model's |
required |
Returns:
Type | Description |
---|---|
Self
|
The loaded model. |
generate
¶
generate(inputs: Tensor, *args: Any, noise_mask: Tensor, return_transformed_embeddings: bool = False, **kwargs: Any) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]
Generate sequences of token ids using transformed embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs |
Tensor
|
The sequences of input tokens to use as a prompt for generation. |
required |
*args |
Any
|
Additional positional arguments to the base model's |
required |
noise_mask |
Tensor
|
The mask that selects the elements of |
required |
return_transformed_embeddings |
bool
|
Whether to return the transformed embeddings. Transformed embeddings can be used with
|
False
|
**kwargs |
Any
|
Additional keyword arguments to the base model's |
required |
Returns:
Type | Description |
---|---|
torch.Tensor | tuple[torch.Tensor, torch.Tensor]
|
The generated token ids and optionally the transformed embeddings. |
Examples:
>>> from stainedglass_core import metrics as sg_metrics
>>> from stainedglass_core.huggingface import generation as sg_generation
>>> pretrained_model_name_or_path = (
... "tests/resources/tokenizers/mini-Mistral-7B-Instruct-v0.2"
... )
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(
... pretrained_model_name_or_path
... )
>>> config = transformers.AutoConfig.from_pretrained(pretrained_model_name_or_path)
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(
... pretrained_model_name_or_path,
... torch_dtype=torch.bfloat16,
... )
>>> noisy_model = NoiseMaskedNoisyTransformerModel(
... transformer_cloak.TransformerCloak,
... base_model,
... input_shape=(-1, config.hidden_size),
... target_layer="model.embed_tokens",
... scale=(1e-8, 1.0),
... transformer_type=transformers.MistralModel,
... config_path=pretrained_model_name_or_path,
... )
>>> batch_size, seq_length = 1, 10
>>> input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_length))
>>> attention_mask = torch.hstack(
... [
... torch.zeros((batch_size, 2), dtype=torch.bool),
... torch.ones((batch_size, seq_length - 2), dtype=torch.bool),
... ]
... )
>>> noise_mask = torch.randint(0, 2, (batch_size, seq_length, 1), dtype=torch.bool)
>>> generation_config = sg_generation.StainedGlassGenerationConfig.from_tokenizer(
... tokenizer, max_length=seq_length + 1
... )
Generation without Stained Glass Transform:
>>> generated_ids = noisy_model.base_model.generate(
... inputs=input_ids,
... generation_config=generation_config,
... attention_mask=attention_mask,
... use_cache=True,
... )
Generation with Stained Glass Transform:
>>> generated_ids_from_transformed_embeddings = noisy_model.generate(
... inputs=input_ids,
... generation_config=generation_config,
... attention_mask=attention_mask,
... use_cache=True,
... noise_mask=noise_mask,
... )
Decoding the generated ids into text:
>>> generated_text_from_transformed_embeddings = tokenizer.batch_decode(
... generated_ids_from_transformed_embeddings[:, input_ids.shape[-1] :],
... skip_special_ids=True,
... )
Using return_transformed_embeddings=True
to compare the reconstructed input ids with the original input ids:
>>> generated_ids_from_transformed_embeddings, transformed_embeddings = (
... noisy_model.generate(
... inputs=input_ids,
... generation_config=generation_config,
... attention_mask=attention_mask,
... use_cache=True,
... noise_mask=noise_mask,
... return_transformed_embeddings=True,
... )
... )
>>> reconstructed_input_ids = noisy_model.reconstruct_ids_from_embeddings(
... transformed_embeddings
... )
>>> reconstructed_input_text = tokenizer.batch_decode(
... reconstructed_input_ids, skip_special_ids=True
... )
>>> percentage_changed_input_ids = sg_metrics.percentage_changed_ids(
... input_ids, reconstructed_input_ids, noise_mask
... )
Added in version 0.86.0. To support generations with noisy models.
get_extra_state
¶
get_extra_state() -> NoisyTransformerModelExtraState[PreTrainedModelT, noisy_model.NLP, noisy_model.NL]
Return the extra state of the model.
Returns:
Type | Description |
---|---|
NoisyTransformerModelExtraState[PreTrainedModelT, noisy_model.NLP, noisy_model.NL]
|
The extra state of the model. |
gradient_checkpointing_enable
¶
Enable gradient checkpointing on the base model.
noise_loss_wrapper
¶
noise_loss_wrapper(criterion: Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]], alpha: float | None, grad_scaler: GradScaler | None = None, backward_wrapper: BackwardWrapper | None = None) -> Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]
Wrap the given criterion with a criterion that optimizes the noise layer.
This method has 2 modes
- If
alpha
is afloat
between0.0
and1.0
, the returned criterion interpolates between the original criterion and a noise loss term, with0.0
devolving to the original criterion and1.0
devolving to the noise loss term. - If
alpha
isNone
, the returned criterion adaptively calculates the noise layer parameter gradient update using the gradients of the original criterion and the noise loss term, optimizing whichever is larger, using only the components of the larger gradient tensor that are orthogonal to the smaller gradient tensor. The loss returned is the original criterion loss, differentiable, but detached from the graph, since the wrapped criterion callsbackward()
itself.
Note
criterion
must either return a torch.Tensor
or a dict
containing torch.Tensor
and must necessarily include the key
'model_loss'.
Note
The noise layer must return a loss tensor in order to optimize the noise layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
criterion |
Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]]
|
The original loss function. |
required |
alpha |
float | None
|
Interpolation factor between the original criterion (0.0) and the noise loss term (1.0). Higher means that noise is
learned more quickly and that more noise can be added. This is a model, task, loss function... dependent hyperparameter
that, in practice, really does range from 0.0001 to 0.9999. Without prior knowledge, you will need to perform a grid search
over different alphas to find the best one for your model and task. Alternatively, if |
required |
grad_scaler |
GradScaler | None
|
A |
None
|
backward_wrapper |
BackwardWrapper | None
|
A managed |
None
|
Returns:
Type | Description |
---|---|
Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]
|
A criterion that optimizes the noise layer using the wrapped criterion and the noise layer loss. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
Examples:
>>> from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer
>>> model = nn.Linear(2, 2)
>>> model1 = sg_model.NoisyModel(
... sg_noise_layer.CloakNoiseLayer1, model, input_shape=(-1, 2)
... )
>>> model2 = sg_model.NoisyModel(
... sg_noise_layer.CloakNoiseLayer2,
... model,
... input_shape=(-1, 2),
... percent_to_mask=0.42,
... )
>>> criterion = nn.functional.mse_loss
>>> input = torch.rand(2, 2)
>>> labels = torch.randint(0, 2, (2, 2), dtype=torch.float32)
Alpha
>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Alphaless
>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Alphaless with AMP
>>> import torch.cuda.amp
>>> grad_scaler = torch.cuda.amp.GradScaler()
>>> stainedglass_loss = model1.noise_loss_wrapper(
... criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(
... criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Changed in version 0.76.1: Added `composite_loss` key to the returned losses dictionary when specifying `alpha=None` to maintain a consistent interface between alpha and alphaless training.
reconstruct_ids_from_embeddings
¶
reconstruct_ids_from_embeddings(embeddings: Tensor) -> torch.Tensor
Reconstruct token ids from embeddings using L2 similarity search on the input embedding layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeddings |
Tensor
|
The embeddings of shape ( |
required |
Returns:
Type | Description |
---|---|
torch.Tensor
|
The token ids of shape ( |
Examples:
>>> pretrained_model_name_or_path = (
... "tests/resources/tokenizers/mini-Mistral-7B-Instruct-v0.2"
... )
>>> config = transformers.AutoConfig.from_pretrained(pretrained_model_name_or_path)
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(
... pretrained_model_name_or_path,
... torch_dtype=torch.bfloat16,
... )
>>> noisy_model = NoiseMaskedNoisyTransformerModel(
... transformer_cloak.TransformerCloak,
... base_model,
... input_shape=(-1, config.hidden_size),
... target_layer="model.embed_tokens",
... scale=(1e-8, 1.0),
... transformer_type=transformers.MistralModel,
... config_path=pretrained_model_name_or_path,
... )
>>> batch_size, seq_length = 1, 10
>>> input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_length))
>>> attention_mask = torch.hstack(
... [
... torch.zeros((batch_size, 2), dtype=torch.bool),
... torch.ones((batch_size, seq_length - 2), dtype=torch.bool),
... ]
... )
>>> noise_mask = torch.randint(0, 2, (batch_size, seq_length, 1), dtype=torch.bool)
restore_and_load
¶
Restore the final decoder layers and final normalization layer and move them back to their original devices.
Raises:
Type | Description |
---|---|
ValueError
|
If the |
sample_transformed_embeddings
¶
Sample transformed embeddings for the given input token ids.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_ids |
Tensor
|
The sequences of input tokens to transform. |
required |
noise_mask |
Tensor
|
The mask that selects the elements of |
required |
**kwargs |
Any
|
Additional keyword arguments to the noise layer's |
required |
Returns:
Type | Description |
---|---|
torch.Tensor
|
Sampled transformed embeddings. |
Examples:
>>> pretrained_model_name_or_path = (
... "tests/resources/tokenizers/mini-Mistral-7B-Instruct-v0.2"
... )
>>> config = transformers.AutoConfig.from_pretrained(pretrained_model_name_or_path)
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(
... pretrained_model_name_or_path,
... torch_dtype=torch.bfloat16,
... )
>>> noisy_model = NoiseMaskedNoisyTransformerModel(
... transformer_cloak.TransformerCloak,
... base_model,
... input_shape=(-1, config.hidden_size),
... target_layer="model.embed_tokens",
... scale=(1e-8, 1.0),
... transformer_type=transformers.MistralModel,
... config_path=pretrained_model_name_or_path,
... )
>>> batch_size, seq_length = 1, 10
>>> input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_length))
>>> attention_mask = torch.hstack(
... [
... torch.zeros((batch_size, 2), dtype=torch.bool),
... torch.ones((batch_size, seq_length - 2), dtype=torch.bool),
... ]
... )
>>> noise_mask = torch.randint(0, 2, (batch_size, seq_length, 1), dtype=torch.bool)
save_pretrained
¶
Save the model to a directory.
Mirrors the save_pretrained
method of the huggingface transformers models so as
to be compatible with their api calls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
save_directory |
str | Path
|
The directory to save the model to. |
required |
only_noise_layer |
bool
|
Whether to only save the noise layer, or also the base model. |
False
|
**kwargs |
Any
|
Keyword arguments to pass to the base model's |
required |
set_extra_state
¶
set_extra_state(state: NoisyTransformerModelExtraState[PreTrainedModelT, NLP, NL]) -> None
Set the extra state contained in the loaded state_dict.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
state |
NoisyTransformerModelExtraState[PreTrainedModelT, NLP, NL]
|
The extra state, returned by |
required |
truncate_and_offload
¶
Remove the decoder layers after truncated_layer_index
and the final normalization layer from the model and move them to the
CPU.
Decoder layer truncation improves runtime performance and lowers memory usage, but removes access to the logits layer and thereby sacrifices the computability of metrics such as perplexity.
Raises:
Type | Description |
---|---|
ValueError
|
If the |