model
NoisyModel
¶
Bases: SGModel[M]
, Generic[M, NLP, NL]
Wrapper class that adds noise to the output of an arbitrary layer of the base model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
noise_layer_class |
NoiseLayerConstructor[NLP, NL]
|
The type of noise that is added to the given model. |
required |
base_model |
M
|
The model to add noise to. |
required |
input_shape |
tuple[int, ...]
|
The shape of the model input; used to infer the shape of the noise layer. |
required |
target_layer |
str
|
Name of the layer whose output noise will be added to. A submodule of the model may be specified by providing the
|
'input'
|
target_parameter |
str | None
|
If the target layer is the input, the keyword parameter to which noise is added (default: None). By default, noise is added to the first positional parameter of the model's forward method. |
None
|
*args |
args
|
Positional arguments to the |
()
|
**kwargs |
kwargs
|
Keyword arguments to the |
{}
|
Raises:
Type | Description |
---|---|
AttributeError
|
If the target_layer does not exist, or if the target layer already has a noise_layer attribute. |
ValueError
|
If the target_layer is not called from model.forward() and its size cannot be determined. |
target_parameter
property
¶
target_parameter: str | None
The base_model.forward
parameter to which noise is added.
target_parameter_index
cached
property
¶
target_parameter_index: int
The base_model.forward
parameter to which noise is added.
forward
¶
Delegate calls to the base model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
args |
Any
|
Inputs to the base model. |
required |
kwargs |
Any
|
Keyword arguments to the base model. |
required |
Returns:
Type | Description |
---|---|
NoisyModelOutput[Any]
|
The result of the underlying model with noise added to the output of the base model's target layer. |
noise_loss_wrapper
¶
noise_loss_wrapper(criterion: Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]], alpha: float | None, grad_scaler: GradScaler | None = None, backward_wrapper: BackwardWrapper | None = None) -> Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]
Wrap the given criterion with a criterion that optimizes the noise layer.
This method has 2 modes
- If
alpha
is afloat
between0.0
and1.0
, the returned criterion interpolates between the original criterion and a noise loss term, with0.0
devolving to the original criterion and1.0
devolving to the noise loss term. - If
alpha
isNone
, the returned criterion adaptively calculates the noise layer parameter gradient update using the gradients of the original criterion and the noise loss term, optimizing whichever is larger, using only the components of the larger gradient tensor that are orthogonal to the smaller gradient tensor. The loss returned is the original criterion loss, differentiable, but detached from the graph, since the wrapped criterion callsbackward()
itself.
Note
criterion
must either return a torch.Tensor
or a dict
containing torch.Tensor
and must necessarily include the key
'model_loss'.
Note
The noise layer must return a loss tensor in order to optimize the noise layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
criterion |
Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]]
|
The original loss function. |
required |
alpha |
float | None
|
Interpolation factor between the original criterion (0.0) and the noise loss term (1.0). Higher means that noise is
learned more quickly and that more noise can be added. This is a model, task, loss function... dependent hyperparameter
that, in practice, really does range from 0.0001 to 0.9999. Without prior knowledge, you will need to perform a grid search
over different alphas to find the best one for your model and task. Alternatively, if |
required |
grad_scaler |
GradScaler | None
|
A |
None
|
backward_wrapper |
BackwardWrapper | None
|
A managed |
None
|
Returns:
Type | Description |
---|---|
Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]
|
A criterion that optimizes the noise layer using the wrapped criterion and the noise layer loss. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
Examples:
>>> from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer
>>> model = nn.Linear(2, 2)
>>> model1 = sg_model.NoisyModel(
... sg_noise_layer.CloakNoiseLayer1, model, input_shape=(-1, 2)
... )
>>> model2 = sg_model.NoisyModel(
... sg_noise_layer.CloakNoiseLayer2,
... model,
... input_shape=(-1, 2),
... percent_to_mask=0.42,
... )
>>> criterion = nn.functional.mse_loss
>>> input = torch.rand(2, 2)
>>> labels = torch.randint(0, 2, (2, 2), dtype=torch.float32)
Alpha
>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Alphaless
>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Alphaless with AMP
>>> import torch.cuda.amp
>>> grad_scaler = torch.cuda.amp.GradScaler()
>>> stainedglass_loss = model1.noise_loss_wrapper(
... criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(
... criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Changed in version 0.76.1: Added `composite_loss` key to the returned losses dictionary when specifying `alpha=None` to maintain a consistent interface between alpha and alphaless training.
NoisyModelDataParallel
¶
Bases: DataParallel
, Generic[M, NLP, NL]
Implements multi-GPU support for NoisyModel
by updating NoisyModel
submodule references in
the replicated modules.
Access to NoisyModel
submodules is granted to the model it wraps by inserting references into the __dict__
objects of certain
wrapped model submodules. When the NoisyModel
is replicated across multiple GPUs, these references become stale and must be updated to
refer to the replicated NoisyModel
submodules.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module |
NoisyModel[M, NLP, NL]
|
The |
required |
device_ids |
Sequence[int | device] | None
|
The CUDA devices to use (default: all devices) |
None
|
output_device |
int | device | None
|
Device location of output (default: device_ids[0]) |
None
|
dim |
int
|
The dimension along which to split the input across the devices (default: 0) |
0
|
forward
¶
Aggregate the noise layer loss across all GPUs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args |
Any
|
Variable length argument list. |
required |
**kwargs |
Any
|
Arbitrary keyword arguments. |
required |
Returns:
Type | Description |
---|---|
noisy_model.NoisyModelOutput[Any]
|
The |
replicate
¶
replicate(module: NoisyModel[M, NLP, NL], device_ids: Sequence[int | device]) -> list[noisy_model.NoisyModel[M, NLP, NL]]
Update the forward hooks to use replicas. This is necessary since the forward hooks are methods bound to the original
NoisyModel
.
NoisyModelOutput
dataclass
¶
Bases: SGModelOutput[T]
The output of NoisyModel.forward()
.
__init_subclass__
¶
Register subclasses as pytree nodes.
This is necessary to synchronize gradients when using torch.nn.parallel.DistributedDataParallel(static_graph=True)
with modules
that output ModelOutput
subclasses.
See: https://github.com/pytorch/pytorch/issues/106690.
to_tuple
¶
Convert self to a tuple containing all the attributes/keys that are not None
.
Returns:
Type | Description |
---|---|
tuple[Any, ...]
|
A tuple of all attributes/keys that are not |
NoisyTransformerModel
¶
Bases: NoisyModel[PreTrainedModelT, NLP, NL]
Overloads NoisyModel
methods to enable adding noise correctly to tensors batched with
sequences, specifically Transformers.
config
property
¶
config: PretrainedConfig
Return the config of the base model.
Returns:
Type | Description |
---|---|
PretrainedConfig
|
The config of the base model. |
target_parameter
property
¶
target_parameter: str | None
The base_model.forward
parameter to which noise is added.
target_parameter_index
cached
property
¶
target_parameter_index: int
The base_model.forward
parameter to which noise is added.
forward
¶
Delegate calls to the base model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
args |
Any
|
Inputs to the base model. |
required |
kwargs |
Any
|
Keyword arguments to the base model. |
required |
Returns:
Type | Description |
---|---|
NoisyModelOutput[Any]
|
The result of the underlying model with noise added to the output of the base model's target layer. |
from_pretrained
classmethod
¶
from_pretrained(save_directory: str | Path, base_model_directory: str | Path | None = None, **kwargs: Any) -> Self
Load the model from save_pretrained
directory, and optionally load the base
model from a different directory.
Mirrors the from_pretrained
method of the huggingface transformers models so as
to be compatible with their api calls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
save_directory |
str | Path
|
The path to the saved model. |
required |
base_model_directory |
str | Path | None
|
The path to the saved base model, if not the same as |
None
|
**kwargs |
Any
|
Keyword arguments to pass to the base model's |
required |
Returns:
Type | Description |
---|---|
Self
|
The loaded model. |
get_extra_state
¶
get_extra_state() -> NoisyTransformerModelExtraState[PreTrainedModelT, noisy_model.NLP, noisy_model.NL]
Return the extra state of the model.
Returns:
Type | Description |
---|---|
NoisyTransformerModelExtraState[PreTrainedModelT, noisy_model.NLP, noisy_model.NL]
|
The extra state of the model. |
gradient_checkpointing_enable
¶
Enable gradient checkpointing on the base model.
noise_loss_wrapper
¶
noise_loss_wrapper(criterion: Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]], alpha: float | None, grad_scaler: GradScaler | None = None, backward_wrapper: BackwardWrapper | None = None) -> Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]
Wrap the given criterion with a criterion that optimizes the noise layer.
This method has 2 modes
- If
alpha
is afloat
between0.0
and1.0
, the returned criterion interpolates between the original criterion and a noise loss term, with0.0
devolving to the original criterion and1.0
devolving to the noise loss term. - If
alpha
isNone
, the returned criterion adaptively calculates the noise layer parameter gradient update using the gradients of the original criterion and the noise loss term, optimizing whichever is larger, using only the components of the larger gradient tensor that are orthogonal to the smaller gradient tensor. The loss returned is the original criterion loss, differentiable, but detached from the graph, since the wrapped criterion callsbackward()
itself.
Note
criterion
must either return a torch.Tensor
or a dict
containing torch.Tensor
and must necessarily include the key
'model_loss'.
Note
The noise layer must return a loss tensor in order to optimize the noise layer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
criterion |
Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]]
|
The original loss function. |
required |
alpha |
float | None
|
Interpolation factor between the original criterion (0.0) and the noise loss term (1.0). Higher means that noise is
learned more quickly and that more noise can be added. This is a model, task, loss function... dependent hyperparameter
that, in practice, really does range from 0.0001 to 0.9999. Without prior knowledge, you will need to perform a grid search
over different alphas to find the best one for your model and task. Alternatively, if |
required |
grad_scaler |
GradScaler | None
|
A |
None
|
backward_wrapper |
BackwardWrapper | None
|
A managed |
None
|
Returns:
Type | Description |
---|---|
Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]
|
A criterion that optimizes the noise layer using the wrapped criterion and the noise layer loss. |
Raises:
Type | Description |
---|---|
ValueError
|
If |
ValueError
|
If |
Examples:
>>> from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer
>>> model = nn.Linear(2, 2)
>>> model1 = sg_model.NoisyModel(
... sg_noise_layer.CloakNoiseLayer1, model, input_shape=(-1, 2)
... )
>>> model2 = sg_model.NoisyModel(
... sg_noise_layer.CloakNoiseLayer2,
... model,
... input_shape=(-1, 2),
... percent_to_mask=0.42,
... )
>>> criterion = nn.functional.mse_loss
>>> input = torch.rand(2, 2)
>>> labels = torch.randint(0, 2, (2, 2), dtype=torch.float32)
Alpha
>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Alphaless
>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Alphaless with AMP
>>> import torch.cuda.amp
>>> grad_scaler = torch.cuda.amp.GradScaler()
>>> stainedglass_loss = model1.noise_loss_wrapper(
... criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(
... criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()
Changed in version 0.76.1: Added `composite_loss` key to the returned losses dictionary when specifying `alpha=None` to maintain a consistent interface between alpha and alphaless training.
save_pretrained
¶
Save the model to a directory.
Mirrors the save_pretrained
method of the huggingface transformers models so as
to be compatible with their api calls.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
save_directory |
str | Path
|
The directory to save the model to. |
required |
only_noise_layer |
bool
|
Whether to only save the noise layer, or also the base model. |
False
|
**kwargs |
Any
|
Keyword arguments to pass to the base model's |
required |
set_extra_state
¶
set_extra_state(state: NoisyTransformerModelExtraState[PreTrainedModelT, NLP, NL]) -> None
Set the extra state contained in the loaded state_dict.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
state |
NoisyTransformerModelExtraState[PreTrainedModelT, NLP, NL]
|
The extra state, returned by |
required |
SGModel
¶
Base class for all stained glass models.
__init__
¶
forward
¶
Delegate calls to the base model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
args |
Any
|
Inputs to the base model. |
required |
kwargs |
Dict[str, Any]
|
Keyword arguments to the base model. |
required |
Returns:
Type | Description |
---|---|
SGModelOutput[Any]
|
The result of the underlying model with noise added to the output of the base model's target layer. |
SGModelOutput
dataclass
¶
Bases: ModelOutput
, Generic[T]
The output of SGModel.forward()
.
__init_subclass__
¶
Register subclasses as pytree nodes.
This is necessary to synchronize gradients when using torch.nn.parallel.DistributedDataParallel(static_graph=True)
with modules
that output ModelOutput
subclasses.
See: https://github.com/pytorch/pytorch/issues/106690.
to_tuple
¶
Convert self to a tuple containing all the attributes/keys that are not None
.
Returns:
Type | Description |
---|---|
tuple[Any, ...]
|
A tuple of all attributes/keys that are not |
TruncatedModule
¶
Bases: Module
, Generic[ModuleT]
A module that wraps another module that interrupts the forward pass when a specified truncation point is reached.
This truncation happens by temporarily adding a hook to the truncation point that raises a
TruncationExecutionFinished
exception which is then caught by
the TruncatedModule
forward and the output of the truncation point is returned.
Examples:
Instantiating a TruncatedModule
with a Binary Classification model and a truncation point:
>>> model = torch.nn.Sequential(
... torch.nn.Linear(10, 20),
... torch.nn.ReLU(),
... torch.nn.Linear(20, 30),
... torch.nn.ReLU(),
... torch.nn.Linear(30, 40),
... torch.nn.ReLU(),
... torch.nn.Linear(40, 2),
... )
>>> truncation_layer = model[1]
>>> truncated_model = TruncatedModule(model, truncation_layer)
Using the TruncatedModule
to get the output of the truncation point:
>>> input = torch.randn(1, 10)
>>> output = truncated_model(input)
>>> # Note that shape of the output has the output_shape of the truncation point, not the full model
>>> assert output.shape == (1, 20)
The base model of the TruncatedModule
is completely unaffected by the truncation:
>>> base_output = model(input)
>>> assert base_output.shape == (1, 2) # Binary classification output shape
The base model is also accessible directly through the module
attribute of the TruncatedModule
:
>>> base_output = truncated_model.module(input)
>>> assert base_output.shape == (1, 2) # Binary classification output shape
Added in version 0.59.0.
__init__
¶
__init__(module: ModuleT, truncation_point: Module) -> None
Initialize the TruncatedModule
with the provided module and truncation point.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module |
ModuleT
|
The module to wrap. |
required |
truncation_point |
Module
|
The submodule of the provided module at which to interrupt the forward pass. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If the truncation point is not a submodule of the provided module. |
forward
¶
Forward pass of the TruncatedModule
that interrupts the forward pass when the truncation point is reached.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
*args |
Any
|
The positional arguments to pass to the wrapped module. |
required |
**kwargs |
Any
|
The keyword arguments to pass to the wrapped module. |
required |
Returns:
Type | Description |
---|---|
Any
|
The output of the truncation point submodule. |
Raises:
Type | Description |
---|---|
HookNotCalledError
|
If the truncation hook is not called, meaning the truncation point was not reached. |
lazy_register_truncation_hook
¶
Create a prehook that will be added to the truncation point to interrupt the forward pass when the truncation point is reached.
Returns:
Type | Description |
---|---|
_HandlerWrapper
|
A handler wrapper that contains the hook that was added to the truncation point. |
truncation_hook
staticmethod
¶
Intercept the output of the truncation point and raise a TruncationExecutionFinished
exception containing that output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
truncation_point |
Module
|
The truncation point submodule. Unused. |
required |
args |
Any
|
The arguments passed to the truncation point. Unused. |
required |
output |
Tensor
|
The output of the truncation point. This is the output that will be returned by the |
required |
Raises:
Type | Description |
---|---|
TruncationExecutionFinished
|
Always, in order to interrupt the wrapped model's |