Skip to content

model

NoisyModel

Bases: SGModel[M], Generic[M, NLP, NL]

Wrapper class that adds noise to the output of an arbitrary layer of the base model.

Parameters:

Name Type Description Default
noise_layer_class NoiseLayerConstructor[NLP, NL]

The type of noise that is added to the given model.

required
base_model M

The model to add noise to.

required
input_shape tuple[int, ...]

The shape of the model input; used to infer the shape of the noise layer.

required
target_layer str

Name of the layer whose output noise will be added to. A submodule of the model may be specified by providing the .-delimited name, e.g. features.0.conv.1.2 (default: 'input').

'input'
target_parameter str | None

If the target layer is the input, the keyword parameter to which noise is added (default: None). By default, noise is added to the first positional parameter of the model's forward method.

None
*args args

Positional arguments to the noise_layer_class.

()
**kwargs kwargs

Keyword arguments to the noise_layer_class.

{}

Raises:

Type Description
AttributeError

If the target_layer does not exist, or if the target layer already has a noise_layer attribute.

ValueError

If the target_layer is not called from model.forward() and its size cannot be determined.

input_shape property

input_shape: tuple[int, ...]

The expected shape input to the base model.

target_layer property

target_layer: Module

The base_model layer to which noise is added.

target_parameter property

target_parameter: str | None

The base_model.forward parameter to which noise is added.

target_parameter_index cached property

target_parameter_index: int

The base_model.forward parameter to which noise is added.

forward

forward(*args: Any, **kwargs: Any) -> NoisyModelOutput[Any]

Delegate calls to the base model.

Parameters:

Name Type Description Default
args Any

Inputs to the base model.

required
kwargs Any

Keyword arguments to the base model.

required

Returns:

Type Description
NoisyModelOutput[Any]

The result of the underlying model with noise added to the output of the base model's target layer.

noise_loss_wrapper

noise_loss_wrapper(criterion: Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]], alpha: float | None, grad_scaler: GradScaler | None = None, backward_wrapper: BackwardWrapper | None = None) -> Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]

Wrap the given criterion with a criterion that optimizes the noise layer.

This method has 2 modes
  1. If alpha is a float between 0.0 and 1.0, the returned criterion interpolates between the original criterion and a noise loss term, with 0.0 devolving to the original criterion and 1.0 devolving to the noise loss term.
  2. If alpha is None, the returned criterion adaptively calculates the noise layer parameter gradient update using the gradients of the original criterion and the noise loss term, optimizing whichever is larger, using only the components of the larger gradient tensor that are orthogonal to the smaller gradient tensor. The loss returned is the original criterion loss, differentiable, but detached from the graph, since the wrapped criterion calls backward() itself.
Note

criterion must either return a torch.Tensor or a dict containing torch.Tensor and must necessarily include the key 'model_loss'.

Note

The noise layer must return a loss tensor in order to optimize the noise layer.

Parameters:

Name Type Description Default
criterion Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]]

The original loss function.

required
alpha float | None

Interpolation factor between the original criterion (0.0) and the noise loss term (1.0). Higher means that noise is learned more quickly and that more noise can be added. This is a model, task, loss function... dependent hyperparameter that, in practice, really does range from 0.0001 to 0.9999. Without prior knowledge, you will need to perform a grid search over different alphas to find the best one for your model and task. Alternatively, if None, either the original criterion loss and the noise loss term are adaptively optimized.

required
grad_scaler GradScaler | None

A GradScaler object to use to scale the alphaless loss gradients when using automatic mixed precision (AMP).

None
backward_wrapper BackwardWrapper | None

A managed GradScaler like accelerate.Accelerator or lightning.fabric.fabric.Fabric to use to scale the alphaless loss gradients when using automatic mixed precision (AMP).

None

Returns:

Type Description
Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]

A criterion that optimizes the noise layer using the wrapped criterion and the noise layer loss.

Raises:

Type Description
ValueError

If grad_scaler and backward_wrapper are both specified.

ValueError

If alpha is not None and it is not between 0.0 and 1.0 exclusive.

Examples:

>>> from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer
>>> model = nn.Linear(2, 2)
>>> model1 = sg_model.NoisyModel(
...     sg_noise_layer.CloakNoiseLayer1, model, input_shape=(-1, 2)
... )
>>> model2 = sg_model.NoisyModel(
...     sg_noise_layer.CloakNoiseLayer2,
...     model,
...     input_shape=(-1, 2),
...     percent_to_mask=0.42,
... )
>>> criterion = nn.functional.mse_loss
>>> input = torch.rand(2, 2)
>>> labels = torch.randint(0, 2, (2, 2), dtype=torch.float32)

Alpha

>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()

Alphaless

>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=None)
            >>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()

Alphaless with AMP

>>> import torch.cuda.amp
>>> grad_scaler = torch.cuda.amp.GradScaler()
>>> stainedglass_loss = model1.noise_loss_wrapper(
...     criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(
...     criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()

Changed in version 0.76.1: Added `composite_loss` key to the returned losses dictionary when specifying `alpha=None` to maintain a consistent interface between alpha and alphaless training.

NoisyModelDataParallel

Bases: DataParallel, Generic[M, NLP, NL]

Implements multi-GPU support for NoisyModel by updating NoisyModel submodule references in the replicated modules.

Access to NoisyModel submodules is granted to the model it wraps by inserting references into the __dict__ objects of certain wrapped model submodules. When the NoisyModel is replicated across multiple GPUs, these references become stale and must be updated to refer to the replicated NoisyModel submodules.

Parameters:

Name Type Description Default
module NoisyModel[M, NLP, NL]

The NoisyModel to be parallelized.

required
device_ids Sequence[int | device] | None

The CUDA devices to use (default: all devices)

None
output_device int | device | None

Device location of output (default: device_ids[0])

None
dim int

The dimension along which to split the input across the devices (default: 0)

0

forward

forward(*args: Any, **kwargs: Any) -> noisy_model.NoisyModelOutput[Any]

Aggregate the noise layer loss across all GPUs.

Parameters:

Name Type Description Default
*args Any

Variable length argument list.

required
**kwargs Any

Arbitrary keyword arguments.

required

Returns:

Type Description
noisy_model.NoisyModelOutput[Any]

The NoisyModelOutput, with the noise_layer_loss field averaged across all GPUs.

replicate

replicate(module: NoisyModel[M, NLP, NL], device_ids: Sequence[int | device]) -> list[noisy_model.NoisyModel[M, NLP, NL]]

Update the forward hooks to use replicas. This is necessary since the forward hooks are methods bound to the original NoisyModel.

NoisyModelOutput dataclass

Bases: SGModelOutput[T]

The output of NoisyModel.forward().

__init_subclass__

__init_subclass__() -> None

Register subclasses as pytree nodes.

This is necessary to synchronize gradients when using torch.nn.parallel.DistributedDataParallel(static_graph=True) with modules that output ModelOutput subclasses.

See: https://github.com/pytorch/pytorch/issues/106690.

to_tuple

to_tuple() -> tuple[Any, ...]

Convert self to a tuple containing all the attributes/keys that are not None.

Returns:

Type Description
tuple[Any, ...]

A tuple of all attributes/keys that are not None.

NoisyTransformerModel

Bases: NoisyModel[PreTrainedModelT, NLP, NL]

Overloads NoisyModel methods to enable adding noise correctly to tensors batched with sequences, specifically Transformers.

config property

Return the config of the base model.

Returns:

Type Description
PretrainedConfig

The config of the base model.

input_shape property

input_shape: tuple[int, ...]

The expected shape input to the base model.

target_layer property

target_layer: Module

The base_model layer to which noise is added.

target_parameter property

target_parameter: str | None

The base_model.forward parameter to which noise is added.

target_parameter_index cached property

target_parameter_index: int

The base_model.forward parameter to which noise is added.

forward

forward(*args: Any, **kwargs: Any) -> NoisyModelOutput[Any]

Delegate calls to the base model.

Parameters:

Name Type Description Default
args Any

Inputs to the base model.

required
kwargs Any

Keyword arguments to the base model.

required

Returns:

Type Description
NoisyModelOutput[Any]

The result of the underlying model with noise added to the output of the base model's target layer.

from_pretrained classmethod

from_pretrained(save_directory: str | Path, base_model_directory: str | Path | None = None, **kwargs: Any) -> Self

Load the model from save_pretrained directory, and optionally load the base model from a different directory.

Mirrors the from_pretrained method of the huggingface transformers models so as to be compatible with their api calls.

Parameters:

Name Type Description Default
save_directory str | Path

The path to the saved model.

required
base_model_directory str | Path | None

The path to the saved base model, if not the same as save_directory.

None
**kwargs Any

Keyword arguments to pass to the base model's from_pretrained method.

required

Returns:

Type Description
Self

The loaded model.

get_extra_state

get_extra_state() -> NoisyTransformerModelExtraState[PreTrainedModelT, noisy_model.NLP, noisy_model.NL]

Return the extra state of the model.

Returns:

Type Description
NoisyTransformerModelExtraState[PreTrainedModelT, noisy_model.NLP, noisy_model.NL]

The extra state of the model.

gradient_checkpointing_enable

gradient_checkpointing_enable() -> None

Enable gradient checkpointing on the base model.

noise_loss_wrapper

noise_loss_wrapper(criterion: Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]], alpha: float | None, grad_scaler: GradScaler | None = None, backward_wrapper: BackwardWrapper | None = None) -> Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]

Wrap the given criterion with a criterion that optimizes the noise layer.

This method has 2 modes
  1. If alpha is a float between 0.0 and 1.0, the returned criterion interpolates between the original criterion and a noise loss term, with 0.0 devolving to the original criterion and 1.0 devolving to the noise loss term.
  2. If alpha is None, the returned criterion adaptively calculates the noise layer parameter gradient update using the gradients of the original criterion and the noise loss term, optimizing whichever is larger, using only the components of the larger gradient tensor that are orthogonal to the smaller gradient tensor. The loss returned is the original criterion loss, differentiable, but detached from the graph, since the wrapped criterion calls backward() itself.
Note

criterion must either return a torch.Tensor or a dict containing torch.Tensor and must necessarily include the key 'model_loss'.

Note

The noise layer must return a loss tensor in order to optimize the noise layer.

Parameters:

Name Type Description Default
criterion Callable[Concatenate[T, CriterionP], Tensor | dict[str, Tensor]]

The original loss function.

required
alpha float | None

Interpolation factor between the original criterion (0.0) and the noise loss term (1.0). Higher means that noise is learned more quickly and that more noise can be added. This is a model, task, loss function... dependent hyperparameter that, in practice, really does range from 0.0001 to 0.9999. Without prior knowledge, you will need to perform a grid search over different alphas to find the best one for your model and task. Alternatively, if None, either the original criterion loss and the noise loss term are adaptively optimized.

required
grad_scaler GradScaler | None

A GradScaler object to use to scale the alphaless loss gradients when using automatic mixed precision (AMP).

None
backward_wrapper BackwardWrapper | None

A managed GradScaler like accelerate.Accelerator or lightning.fabric.fabric.Fabric to use to scale the alphaless loss gradients when using automatic mixed precision (AMP).

None

Returns:

Type Description
Callable[Concatenate[NoisyModelOutput[T], CriterionP], dict[str, torch.Tensor]]

A criterion that optimizes the noise layer using the wrapped criterion and the noise layer loss.

Raises:

Type Description
ValueError

If grad_scaler and backward_wrapper are both specified.

ValueError

If alpha is not None and it is not between 0.0 and 1.0 exclusive.

Examples:

>>> from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer
>>> model = nn.Linear(2, 2)
>>> model1 = sg_model.NoisyModel(
...     sg_noise_layer.CloakNoiseLayer1, model, input_shape=(-1, 2)
... )
>>> model2 = sg_model.NoisyModel(
...     sg_noise_layer.CloakNoiseLayer2,
...     model,
...     input_shape=(-1, 2),
...     percent_to_mask=0.42,
... )
>>> criterion = nn.functional.mse_loss
>>> input = torch.rand(2, 2)
>>> labels = torch.randint(0, 2, (2, 2), dtype=torch.float32)

Alpha

>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=0.8)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'noise_loss': tensor(...), 'composite_loss': tensor(...)}
>>> losses["composite_loss"].backward()

Alphaless

>>> stainedglass_loss = model1.noise_loss_wrapper(criterion, alpha=None)
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(criterion, alpha=None)
            >>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()

Alphaless with AMP

>>> import torch.cuda.amp
>>> grad_scaler = torch.cuda.amp.GradScaler()
>>> stainedglass_loss = model1.noise_loss_wrapper(
...     criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...), 'alpha (std_estimator.module.weight)': tensor(...), 'scaling factor (std_estimator.module.weight)': tensor(...)}
>>> losses["composite_loss"].backward()
>>> stainedglass_loss = model2.noise_loss_wrapper(
...     criterion, alpha=None, grad_scaler=grad_scaler
... )
>>> losses = stainedglass_loss(model1(input), labels)
>>> losses
{'model_loss': tensor(...), 'composite_loss': tensor(...), 'noise_loss': tensor(...)}
>>> losses["composite_loss"].backward()

Changed in version 0.76.1: Added `composite_loss` key to the returned losses dictionary when specifying `alpha=None` to maintain a consistent interface between alpha and alphaless training.

save_pretrained

save_pretrained(save_directory: str | Path, only_noise_layer: bool = False, **kwargs: Any) -> None

Save the model to a directory.

Mirrors the save_pretrained method of the huggingface transformers models so as to be compatible with their api calls.

Parameters:

Name Type Description Default
save_directory str | Path

The directory to save the model to.

required
only_noise_layer bool

Whether to only save the noise layer, or also the base model.

False
**kwargs Any

Keyword arguments to pass to the base model's save_pretrained method.

required

set_extra_state

set_extra_state(state: NoisyTransformerModelExtraState[PreTrainedModelT, NLP, NL]) -> None

Set the extra state contained in the loaded state_dict.

Parameters:

Name Type Description Default
state NoisyTransformerModelExtraState[PreTrainedModelT, NLP, NL]

The extra state, returned by get_extra_state.

required

SGModel

Bases: Module, Generic[M]

Base class for all stained glass models.

input_shape property

input_shape: tuple[int, ...]

The expected shape input to the base model.

__init__

__init__(base_model: M, input_shape: tuple[int, ...]) -> None

Initialize the model.

Parameters:

Name Type Description Default
base_model M

The base model.

required
input_shape tuple[int, ...]

The expected shape input to the base model.

required

forward

forward(*args: Any, **kwargs: Any) -> SGModelOutput[Any]

Delegate calls to the base model.

Parameters:

Name Type Description Default
args Any

Inputs to the base model.

required
kwargs Dict[str, Any]

Keyword arguments to the base model.

required

Returns:

Type Description
SGModelOutput[Any]

The result of the underlying model with noise added to the output of the base model's target layer.

SGModelOutput dataclass

Bases: ModelOutput, Generic[T]

The output of SGModel.forward().

__init_subclass__

__init_subclass__() -> None

Register subclasses as pytree nodes.

This is necessary to synchronize gradients when using torch.nn.parallel.DistributedDataParallel(static_graph=True) with modules that output ModelOutput subclasses.

See: https://github.com/pytorch/pytorch/issues/106690.

to_tuple

to_tuple() -> tuple[Any, ...]

Convert self to a tuple containing all the attributes/keys that are not None.

Returns:

Type Description
tuple[Any, ...]

A tuple of all attributes/keys that are not None.

TruncatedModule

Bases: Module, Generic[ModuleT]

A module that wraps another module that interrupts the forward pass when a specified truncation point is reached.

This truncation happens by temporarily adding a hook to the truncation point that raises a TruncationExecutionFinished exception which is then caught by the TruncatedModule forward and the output of the truncation point is returned.

Examples:

Instantiating a TruncatedModule with a Binary Classification model and a truncation point:

>>> model = torch.nn.Sequential(
...     torch.nn.Linear(10, 20),
...     torch.nn.ReLU(),
...     torch.nn.Linear(20, 30),
...     torch.nn.ReLU(),
...     torch.nn.Linear(30, 40),
...     torch.nn.ReLU(),
...     torch.nn.Linear(40, 2),
... )
>>> truncation_layer = model[1]
>>> truncated_model = TruncatedModule(model, truncation_layer)

Using the TruncatedModule to get the output of the truncation point:

>>> input = torch.randn(1, 10)
>>> output = truncated_model(input)
>>> # Note that shape of the output has the output_shape of the truncation point, not the full model
>>> assert output.shape == (1, 20)

The base model of the TruncatedModule is completely unaffected by the truncation:

>>> base_output = model(input)
>>> assert base_output.shape == (1, 2)  # Binary classification output shape

The base model is also accessible directly through the module attribute of the TruncatedModule:

>>> base_output = truncated_model.module(input)
>>> assert base_output.shape == (1, 2)  # Binary classification output shape

Added in version 0.59.0.

__init__

__init__(module: ModuleT, truncation_point: Module) -> None

Initialize the TruncatedModule with the provided module and truncation point.

Parameters:

Name Type Description Default
module ModuleT

The module to wrap.

required
truncation_point Module

The submodule of the provided module at which to interrupt the forward pass.

required

Raises:

Type Description
ValueError

If the truncation point is not a submodule of the provided module.

forward

forward(*args: Any, **kwargs: Any) -> Any

Forward pass of the TruncatedModule that interrupts the forward pass when the truncation point is reached.

Parameters:

Name Type Description Default
*args Any

The positional arguments to pass to the wrapped module.

required
**kwargs Any

The keyword arguments to pass to the wrapped module.

required

Returns:

Type Description
Any

The output of the truncation point submodule.

Raises:

Type Description
HookNotCalledError

If the truncation hook is not called, meaning the truncation point was not reached.

lazy_register_truncation_hook

lazy_register_truncation_hook() -> _HandlerWrapper

Create a prehook that will be added to the truncation point to interrupt the forward pass when the truncation point is reached.

Returns:

Type Description
_HandlerWrapper

A handler wrapper that contains the hook that was added to the truncation point.

truncation_hook staticmethod

truncation_hook(truncation_point: Module, args: Any, output: Tensor) -> NoReturn

Intercept the output of the truncation point and raise a TruncationExecutionFinished exception containing that output.

Parameters:

Name Type Description Default
truncation_point Module

The truncation point submodule. Unused.

required
args Any

The arguments passed to the truncation point. Unused.

required
output Tensor

The output of the truncation point. This is the output that will be returned by the TruncatedModule.

required

Raises:

Type Description
TruncationExecutionFinished

Always, in order to interrupt the wrapped model's forward method.