Skip to content

transformer_cloak

Module for Transformer Cloak noise layers.

Classes:

Name Description
TransformerCloak

Applies a stochastic transformation to a causal language model embedding Tensor using TransformerBlockEstimator,

TransformerBlockEstimator

Bases: Module, Generic[TransformerT]

Estimates components of sequence dependent noise using a single layer transformer model.

Parameters:

Name Type Description Default

transformer_type

type[TransformerT]

The type of transformer model to build a single layer estimator of, e.g. transformers.LlamaModel or transformers.MistralModel.

required

config

PretrainedConfig

Transformers config.

required

initial

float

Initial value of the final Linear bias.

0.0

dropout

float

Dropout probability of the transformer model output.

0.1

initialization_scale

float

The scale factor to multiply the initial values of linear.weight by.

FIVE_PERCENT

dtype

dtype

The torch dtype to initialize the transformer block estimator with.

float32

num_hidden_layers

int

The number of hidden layers to use in the transformer of the TransformerBlockEstimator.

1

attn_implementation

SupportedAttentionImplementationsType

The attention implementation to used. Supported values are "transformers_default", "sdpa", "flash_attention_2", and "flex_attention" and None. If None, the flex and flash attentions implementations are attempted with sdpa as a fallback. If "transformers_default" is specified, the attention implementation defined by the transformer config is used.

None

trust_remote_code

bool

Whether to trust remote code when loading from HuggingFace Hub.

False

Changed in version v1.3.0: Added support for multilayer estimators.

Changed in version v1.13.0: Added support for SGT4Text explicitly setting attention implementation.

Changed in version v3.1.0: Added `trust_remote_code` parameter to allow deserialization of third party tokenizers and models.

Methods:

Name Description
forward

Compose the transformer block with a dropout and a linear adapter layer.

reset_parameters

Reinitialize parameters and buffers.

tensor_parallel

Tensor parallelize the model across the given device mesh.

forward

forward(*args: Any, **kwargs: Any) -> torch.Tensor

Compose the transformer block with a dropout and a linear adapter layer.

Parameters:

Name Type Description Default

*args

Any

Positional arguments to the transformer model.

required

**kwargs

Any

Keyword arguments to the transformer model.

required

Returns:

Type Description
torch.Tensor

The output of the transformer parameter model.

reset_parameters

reset_parameters() -> None

Reinitialize parameters and buffers.

This method is useful for initializing tensors created on the meta device.

tensor_parallel

tensor_parallel(mesh: DeviceMesh) -> None

Tensor parallelize the model across the given device mesh.

Parameters:

Name Type Description Default

mesh

DeviceMesh

The tensor parallel device mesh.

required

Raises:

Type Description
NotImplementedError

If the transformer does not support tensor parallelism.

TransformerCloak

Bases: BaseNoiseLayer[TransformerBlockEstimator[TransformerT], CloakStandardDeviationParameterization | DirectStandardDeviationParameterization, PercentMasker]

Applies a stochastic transformation to a causal language model embedding Tensor using TransformerBlockEstimator, with standard deviations parameterized by either CloakStandardDeviationParameterization or DirectStandardDeviationParameterization, and optional standard deviation-based input masking using PercentMasker.

Parameters:

Name Type Description Default

scale

tuple[float, float]

The range of standard deviations of the noise.

required

transformer_type

type[TransformerT]

The type of the transformer to build a single layer estimator from.

required

config

PretrainedConfig | str | None

A PreTrainedConfig or a filepath to one that can be loaded via PreTrainedConfig.from_pretrained.

None

config_path

str | None

A filepath that can be loaded via PreTrainedConfig.from_pretrained. Deprecated. Use config instead, which also accepts a filepath. This argument will be removed in a future version.

None

percent_to_mask

float | None

The percentage of the input to mask.

None

shallow

float

A fixed temperature like parameter which alters the scale of the standard deviation of the noise.

1.0

seed

int | None

Seed for the random number generator used to generate noise.

None

rho_init

float

Initial values for rhos.

-3.0

std_dropout

float

Dropout ratio for std parameter model.

0.0

mean_dropout

float

Dropout ratio for mean parameter model.

0.0

directly_learn_stds

bool

Whether or not the rhos estimator is used to learn rhos (values in R) or standard deviations directly (values in R^+).

False

noise_layer_dtype

dtype | None

The dtype of the noise layer.

None

num_hidden_layers

int

The number of hidden layers to use in the transformer model of the TransformerBlockEstimator.

1

noise_layer_attention

SupportedAttentionImplementationsType

The attention implementation to used. Supported values are "transformers_default", "sdpa", "flash_attention_2", and "flex_attention" and None. If None, the flex and flash attentions implementations are attempted with sdpa as a fallback. If "transformers_default" is specified, the attention implementation defined by the transformer config is used.

None

trust_remote_code

bool

Whether to trust remote code when loading from HuggingFace Hub.

False

kwargs

Any

Keyword arguments used to define the transformer parameter models. Ignored if config_path is an initialized transformers config.

required

Raises:

Type Description
ValueError

If shallow is not 1.0 when directly_learn_stds is True.

ValueError

If rho_init is not 0.0 when directly_learn_stds is True.

Methods:

Name Description
__call__

Transform the input data.

__getstate__

Prepare a JSON-serializable copy of the noise layer's state.

__init__
__setstate__

Set the state of the object.

forward

Transform the input data.

get_applied_transform_components_factory

Create a function that returns the elements of the transform components ('mean' and 'std') applied during the most recent

get_transformed_output_factory

Create a function that returns the transformed output from the most recent forward pass.

initial_seed

Return the initial seed of the CPU device's random number generator.

manual_seed

Seed each of the random number generators.

reset_parameters

Reinitialize parameters and buffers.

seed

Seed each of the random number generators using a non-deterministic random number.

tensor_parallel

Tensor parallelize the model across the given device mesh.

__call__

__call__(
    input: Tensor,
    noise_mask: Tensor | None = None,
    **kwargs: Any
) -> torch.Tensor

Transform the input data.

Parameters:

Name Type Description Default

input

Tensor

The input to transform.

required

noise_mask

Tensor | None

An optional mask that selects the elements of input to transform. Where the mask is False, the original input value is returned. Also used to select the elements of the sampled standard deviations to use to mask the input. If None, the entire input is transformed.

None

**kwargs

Any

Additional keyword arguments to the estimator modules.

required

__getstate__

__getstate__() -> dict[str, Any]

Prepare a JSON-serializable copy of the noise layer's state.

Returns:

Type Description
dict[str, Any]

A dictionary containing the configuration of the noise layer, including its type string, the state dict, and the generator

dict[str, Any]

states if they exist.

Changed in version v1.13.0: Added support for SGT4Text explicitly setting attention implementation.

Changed in version v3.15.0: Added serialization support for all noise layers.

__init__

__init__(
    scale: tuple[float, float],
    transformer_type: type[TransformerT],
    config: PretrainedConfig | str | None = None,
    config_path: str | None = None,
    percent_to_mask: float | None = None,
    shallow: float = 1.0,
    seed: int | None = None,
    rho_init: float = -3.0,
    std_dropout: float = 0.0,
    mean_dropout: float = 0.0,
    directly_learn_stds: bool = False,
    noise_layer_dtype: dtype | None = None,
    num_hidden_layers: int = 1,
    noise_layer_attention: SupportedAttentionImplementationsType = None,
    trust_remote_code: bool = False,
    **kwargs: Any
) -> None

Changed in version v1.3.0: Added support for multilayer estimators.

Changed in version v1.13.0: Added support for SGT4Text explicitly setting attention implementation.

Changed in version v3.1.0: Added `trust_remote_code` parameter to allow deserialization of third party tokenizers and models.

__setstate__

__setstate__(
    state: dict[str, Any],
    trust_remote_code: bool = False,
    third_party_model_path: (
        str | PathLike[str] | None
    ) = None,
) -> None

Set the state of the object.

state_dict and _generators are both optional keys, and will be restored if they exist in the state.

Parameters:

Name Type Description Default

state

dict[str, Any]

The state to set.

required

trust_remote_code

bool

Whether to trust remote code when loading from HuggingFace Hub.

False

third_party_model_path

str | PathLike[str] | None

The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends on transformers which are not importable directly through transformers, but are present on the Hugging Face Hub.

None

Changed in version v3.15.0: Added serialization support for all noise layers.

forward

forward(
    input: Tensor,
    noise_mask: Tensor | None = None,
    **kwargs: Any
) -> torch.Tensor

Transform the input data.

Parameters:

Name Type Description Default

input

Tensor

The input to transform.

required

noise_mask

Tensor | None

A mask that selects the elements of input to transform. Where the mask is False, the original input value is returned. Also used to select the elements of the sampled standard deviations to use to mask the input.

None

**kwargs

Any

Additional keyword arguments to the estimator modules.

required

Returns:

Type Description
torch.Tensor

The transformed input data.

Raises:

Type Description
ValueError

If the noise_mask is None.

get_applied_transform_components_factory

get_applied_transform_components_factory() -> (
    Callable[[], dict[str, torch.Tensor]]
)

Create a function that returns the elements of the transform components ('mean' and 'std') applied during the most recent forward pass.

Specifically, the applied elements are those selected by the noise mask (if supplied) and standard deviation mask (if std_estimator.masker is not None). If no masks are used, all elements are returned.

The applied transform components are returned flattened.

This function is intended to be used to log histograms of the transform components.

Returns:

Type Description
Callable[[], dict[str, torch.Tensor]]

A function that returns the the elements of the transform components applied during the most recent forward pass.

Examples:

>>> from torch import nn
>>> from stainedglass_core import model as sg_model, noise_layer as sg_noise_layer
>>> base_model = nn.Linear(20, 2)
>>> noisy_model = sg_model.NoisyModel(
...     sg_noise_layer.CloakNoiseLayer1,
...     base_model,
...     target_parameter="input",
... )
>>> get_applied_transform_components = (
...     noisy_model.noise_layer.get_applied_transform_components_factory()
... )
>>> input = torch.ones(1, 20)
>>> noise_mask = torch.tensor(5 * [False] + 15 * [True])
>>> output = noisy_model(input, noise_mask=noise_mask)
>>> applied_transform_components = get_applied_transform_components()
>>> applied_transform_components
{'mean': tensor(...), 'std': tensor(...)}
>>> {
...     component_name: component.shape
...     for component_name, component in applied_transform_components.items()
... }
{'mean': torch.Size([15]), 'std': torch.Size([15])}

get_transformed_output_factory

get_transformed_output_factory() -> (
    Callable[[], torch.Tensor]
)

Create a function that returns the transformed output from the most recent forward pass.

If super batching is active, only the transformed half of the super batch output is returned.

Returns:

Type Description
Callable[[], torch.Tensor]

A function that returns the transformed output from the most recent forward pass.

Examples:

>>> from stainedglass_core import noise_layer as sg_noise_layer
>>> noise_layer = sg_noise_layer.CloakNoiseLayer1()
>>> get_transformed_output = noise_layer.get_transformed_output_factory()
>>> input = torch.ones(2, 3, 32, 32)
>>> output = noise_layer(input)
>>> transformed_output = get_transformed_output()
>>> assert output.equal(transformed_output)

initial_seed

initial_seed() -> int

Return the initial seed of the CPU device's random number generator.

manual_seed

manual_seed(
    seed: int | None, rank_dependent: bool = True
) -> None

Seed each of the random number generators.

Setting seed to None will destroy any existing generators.

Parameters:

Name Type Description Default

seed

int | None

The seed to set.

required

rank_dependent

bool

Whether to add the distributed rank to the seed to ensure that each process samples different noise.

True

reset_parameters

reset_parameters() -> None

Reinitialize parameters and buffers.

This method is useful for initializing tensors created on the meta device.

seed

seed() -> None

Seed each of the random number generators using a non-deterministic random number.

tensor_parallel

tensor_parallel(mesh: DeviceMesh) -> None

Tensor parallelize the model across the given device mesh.

Parameters:

Name Type Description Default

mesh

DeviceMesh

The tensor parallel device mesh.

required

transformer_parameter_model

transformer_parameter_model(
    transformer_type: type[TransformerT],
    config: PretrainedConfig,
    num_hidden_layers: int = 1,
    attn_implementation: SupportedAttentionImplementationsType = None,
    trust_remote_code: bool = False,
) -> TransformerT

Create a single block of a transformers.PreTrainedModel and loads the weights from the parameter path.

Parameters:

Name Type Description Default

transformer_type

type[TransformerT]

The type of the transformer to use to construct the transformer parameter model.

required

config

PretrainedConfig

Transformer config.

required

num_hidden_layers

int

The number of hidden layers to use in the transformer model.

1

attn_implementation

SupportedAttentionImplementationsType

The attention implementation to used. Supported values are "transformers_default", "sdpa", "flash_attention_2", and "flex_attention" and None. If None, the flex and flash attentions implementations are attempted with sdpa as a fallback. If "transformers_default" is specified, the attention implementation defined by the transformer config is used.

None

trust_remote_code

bool

Whether to trust remote code when loading from HuggingFace Hub.

False

Returns:

Type Description
TransformerT

A transformer that can be used to estimate rhos/locs.

Raises:

Type Description
ValueError

If the attention implementation is not supported.

TypeError

If the transformer type does not match the loaded config.

Changed in version v1.3.0: Added support for multilayer estimators.

Changed in version v1.10.0: Remove non-causal mask support via `use_causal_mask`.

Changed in version v1.13.0: Added support for SGT4Text explicitly setting attention implementation.

Changed in version v2.22.0: Removed deepspeed mixture of experts support from transformer cloak.

Changed in version v3.1.0: Added `trust_remote_code` parameter to allow deserialization of third party tokenizers and models.