noisy_transformer_masking_model

Module for a noisy transformer model with masking capabilities.

Classes:

Name	Description
`NoiseMaskedNoisyTransformerModel`	A `NoisyModel` that adds noise to a portion of the inputs, excluding

NoiseMaskedNoisyTransformerModel ¶

Bases: NoisyModel[CausalModelT, NoiseLayerP, NoiseLayerT_co]

A NoisyModel that adds noise to a portion of the inputs, excluding any special tokens.

Parameters:

Name	Type	Description	Default
`noise_layer_class` ¶	`Callable[NoiseLayerP, NoiseLayerT_co]`	The type of noise that is added to the given model.	required
`base_model` ¶	`CausalModelT`	The model to add noise to.	required
`target_layer` ¶	`str \| None`	Name of the layer whose output noise will be added to. A submodule of the model may be specified by providing the `.`-delimited name, e.g. features.0.conv.1.2 (default: 'input').	`None`
`target_parameter` ¶	`str \| None`	If the target layer is the input, the keyword parameter to which noise is added (default: None). By default, noise is added to the first positional parameter of the model's forward method.	`None`
`truncated_layer_index` ¶	`int \| None`	The layer index to truncate the model at.	`None`
`*args` ¶	`args`	Positional arguments to the `noise_layer_class`.	`()`
`**kwargs` ¶	`kwargs`	Keyword arguments to the `noise_layer_class`.	`{}`

Methods:

Name	Description
`__getstate__`	Serialize the model to a dictionary.
`__setstate__`	Deserialize the model from a dictionary.
`distillation_context`	Prepare the base model to facilitate distillation training by applying losses over the transformed and non-transformed
`forward`	Call the `base_model`, applying the `noise_layer` to the `target_parameter` or `target_layer` output.
`generate`	Generate sequences of token ids using transformed embeddings.
`reconstruct_ids_from_embeddings`	Reconstruct token ids from embeddings using L2 similarity search on the input embedding layer.
`reset_parameters`	Reinitialize parameters and buffers.
`restore_and_load`	Restore the final decoder layers and final normalization layer and move them back to their original devices.
`sample_transformed_embeddings`	Sample transformed embeddings for the given input token ids.
`truncate_and_offload`	Remove the decoder layers after `truncated_layer_index` and the final normalization layer from the model and move them to the

Attributes:

Name	Type	Description
`input_embeddings`	`Embedding`	A copy of input embeddings.
`is_truncated_and_offloaded`	`bool`	Whether the model decoder layers are currently truncated.
`single_precision_input_embeddings`	`Embedding`	A single-precision copy of input embeddings.
`target_layer`	`Module`	The `base_model` submodule whose output `Tensor` to transform.
`target_parameter`	`str \| None`	The name of the `base_model` input `Tensor` argument to transform when `target_layer` is `None`.
`target_parameter_index`	`int`	The index of the `base_model` input `Tensor` argument to transform when `target_layer` is `None`.

input_embeddings `cached` `property` ¶

input_embeddings: Embedding

A copy of input embeddings.

is_truncated_and_offloaded `property` ¶

is_truncated_and_offloaded: bool

Whether the model decoder layers are currently truncated.

single_precision_input_embeddings `cached` `property` ¶

single_precision_input_embeddings: Embedding

A single-precision copy of input embeddings.

target_layer `property` ¶

target_layer: Module

The base_model submodule whose output Tensor to transform.

Raises:

Type	Description
`ValueError`	If `_target_layer` cannot be found as a submodule of `base_model`.

target_parameter `property` ¶

target_parameter: str | None

The name of the base_model input Tensor argument to transform when target_layer is None.

target_parameter_index `cached` `property` ¶

target_parameter_index: int

The index of the base_model input Tensor argument to transform when target_layer is None.

getstate ¶

__getstate__() -> dict[str, Any]

Serialize the model to a dictionary.

Returns:

Type	Description
`dict[str, Any]`	A dictionary containing the model's state, including the base model, noise layer, and state dict.

setstate ¶

__setstate__(
    state: Mapping[str, Any],
    trust_remote_code: bool = False,
    third_party_model_path: (
        str | PathLike[str] | None
    ) = None,
) -> None

Deserialize the model from a dictionary.

Warning

The stat_dict key is considered optional. If it is not present, or incomplete, the missing parameters will be initialized to the meta device. Allowing this to be optional enables the NoiseMaskedNoisyTransformerModel parameters to be restored as part of a larger model.

Parameters:

Name	Type	Description	Default
`state` ¶	`Mapping[str, Any]`	A dictionary containing the model's state, including the base model, noise layer, and possibly state dict.	required
`trust_remote_code` ¶	`bool`	Whether to trust remote code when loading from the Hugging Face Hub.	`False`
`third_party_model_path` ¶	`str \| PathLike[str] \| None`	The path or huggingface reference to a third-party model to load. This is useful when loading SGTs whose internal structure depends on transformers which are not importable directly through transformers, but are present on the Hugging Face Hub.	`None`

distillation_context ¶

distillation_context() -> contextlib.ExitStack

Prepare the base model to facilitate distillation training by applying losses over the transformed and non-transformed activations.

Returns:

Type	Description
`contextlib.ExitStack`	A context manager that detaches the hooks when exited.

forward ¶

forward(
    *args: Any, noise_mask: Tensor, **kwargs: Any
) -> Any

Call the base_model, applying the noise_layer to the target_parameter or target_layer output.

Parameters:

Name	Type	Description	Default
`*args` ¶	`Any`	Positional arguments to `base_model`.	required
`noise_mask` ¶	`Tensor`	A mask that selects the elements of the `target_layer` output to transform. Where the mask is `False`, the original values of the target are used.	required
`**kwargs` ¶	`Any`	Keyword arguments to `base_model`.	required

Returns:

Type	Description
`Any`	The result of `base_model` with the `noise_layer` applied to the `target_parameter` or `target_layer` output.

generate ¶

generate(
    inputs: Tensor,
    *args: Any,
    noise_mask: Tensor,
    return_transformed_embeddings: bool = False,
    **kwargs: Any
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]

generate(
    inputs: Tensor,
    *args: Any,
    noise_mask: Tensor,
    return_transformed_embeddings: bool = False,
    **kwargs: Any
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]

generate(
    inputs: Tensor,
    *args: Any,
    noise_mask: Tensor,
    return_transformed_embeddings: bool = False,
    **kwargs: Any
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]

generate(
    inputs: Tensor,
    *args: Any,
    noise_mask: Tensor,
    return_transformed_embeddings: bool = False,
    **kwargs: Any
) -> torch.Tensor | tuple[torch.Tensor, torch.Tensor]

Generate sequences of token ids using transformed embeddings.

Parameters:

Name	Type	Description	Default
`inputs` ¶	`Tensor`	The sequences of input tokens to use as a prompt for generation.	required
`*args` ¶	`Any`	Additional positional arguments to the base model's `generate` method.	required
`noise_mask` ¶	`Tensor`	The mask that selects the elements of `inputs` to transform. Where the mask is `False`, the values of `inputs` are passed through to the base model.	required
`return_transformed_embeddings` ¶	`bool`	Whether to return the transformed embeddings. Transformed embeddings can be used with `stainedglass_core.model.noisy_transformer_masking_model.NoiseMaskedNoisyTransformerModel.reconstruct_ids_from_embeddings` and compared against `inputs` to estimate transformation strength.	`False`
`**kwargs` ¶	`Any`	Additional keyword arguments to the base model's `generate` method.	required

Returns:

Type	Description
`torch.Tensor \| tuple[torch.Tensor, torch.Tensor]`	The generated token ids and optionally the transformed embeddings.

Examples:

>>> from stainedglass_core import metrics as sg_metrics
>>> from stainedglass_core.huggingface import generation as sg_generation
>>> from stainedglass_core.utils import huggingface as sg_huggingface_utils
>>> pretrained_model_name_or_path = (
...     "tests/resources/tokenizers/mini-Meta-Llama-3-8B"
... )
>>> tokenizer = transformers.AutoTokenizer.from_pretrained(
...     pretrained_model_name_or_path
... )
>>> config = transformers.AutoConfig.from_pretrained(pretrained_model_name_or_path)
>>> config.dtype = torch.float32
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(
...     pretrained_model_name_or_path,
...     dtype=torch.bfloat16,
... )
>>> noisy_model = NoiseMaskedNoisyTransformerModel(
...     transformer_cloak.TransformerCloak,
...     base_model,
...     target_layer="model.embed_tokens",
...     scale=(1e-8, 1.0),
...     transformer_type=type(
...         sg_huggingface_utils.get_base_model_decoder(base_model)
...     ),
...     config=config,
... )
>>> batch_size, seq_length = 1, 10
>>> input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_length))
>>> attention_mask = torch.hstack(
...     [
...         torch.zeros((batch_size, 2), dtype=torch.bool),
...         torch.ones((batch_size, seq_length - 2), dtype=torch.bool),
...     ]
... )
>>> noise_mask = torch.randint(0, 2, (batch_size, seq_length, 1), dtype=torch.bool)
>>> generation_config = sg_generation.StainedGlassGenerationConfig.from_tokenizer(
...     tokenizer, max_length=seq_length + 1
... )

Generation without Stained Glass Transform:

>>> generated_ids = noisy_model.base_model.generate(
...     inputs=input_ids,
...     generation_config=generation_config,
...     attention_mask=attention_mask,
...     use_cache=True,
... )

Generation with Stained Glass Transform:

>>> generated_ids_from_transformed_embeddings = noisy_model.generate(
...     inputs=input_ids,
...     generation_config=generation_config,
...     attention_mask=attention_mask,
...     use_cache=True,
...     noise_mask=noise_mask,
... )

Decoding the generated ids into text:

>>> generated_text_from_transformed_embeddings = tokenizer.batch_decode(
...     generated_ids_from_transformed_embeddings[:, input_ids.shape[-1] :],
...     skip_special_ids=True,
... )

Using return_transformed_embeddings=True to compare the reconstructed input ids with the original input ids:

>>> generated_ids_from_transformed_embeddings, transformed_embeddings = (
...     noisy_model.generate(
...         inputs=input_ids,
...         generation_config=generation_config,
...         attention_mask=attention_mask,
...         use_cache=True,
...         noise_mask=noise_mask,
...         return_transformed_embeddings=True,
...     )
... )
>>> reconstructed_input_ids = noisy_model.reconstruct_ids_from_embeddings(
...     transformed_embeddings
... )
>>> reconstructed_input_text = tokenizer.batch_decode(
...     reconstructed_input_ids, skip_special_ids=True
... )
>>> percentage_changed_input_ids = sg_metrics.percentage_changed_ids(
...     input_ids, reconstructed_input_ids, noise_mask
... )

reconstruct_ids_from_embeddings ¶

reconstruct_ids_from_embeddings(
    embeddings: Tensor,
    max_batch_size: int | None = None,
    max_sequence_length: int | None = None,
    max_num_embeddings: int | None = None,
) -> torch.Tensor

Reconstruct token ids from embeddings using L2 similarity search on the input embedding layer.

Smaller values of max_batch_size, max_sequence_length, and max_num_embeddings require less memory to store the intermediate distance calculations but have longer runtimes.

Parameters:

Name	Type	Description	Default
`embeddings` ¶	`Tensor`	The embeddings of shape (`batch_size`, `sequence_length`, `hidden_size`) to reconstruct.	required
`max_batch_size` ¶	`int \| None`	The maximum number of batch elements over which to calculate distances.	`None`
`max_sequence_length` ¶	`int \| None`	The maximum number of sequence positions over which to calculate distances.	`None`
`max_num_embeddings` ¶	`int \| None`	The maximum number of embeddings over which to calculate distances. The results from each split are recursively merged together.	`None`

Returns:

Type	Description
`torch.Tensor`	The token ids of shape (`batch_size`, `sequence_length`) of the closest embeddings in the input embedding layer to `embeddings`.

Examples:

>>> from stainedglass_core.utils import huggingface as sg_huggingface_utils
>>> pretrained_model_name_or_path = (
...     "tests/resources/tokenizers/mini-Meta-Llama-3-8B"
... )
>>> config = transformers.AutoConfig.from_pretrained(pretrained_model_name_or_path)
>>> config.dtype = torch.float32
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(
...     pretrained_model_name_or_path,
...     dtype=torch.bfloat16,
... )
>>> noisy_model = NoiseMaskedNoisyTransformerModel(
...     transformer_cloak.TransformerCloak,
...     base_model,
...     target_layer="model.embed_tokens",
...     scale=(1e-8, 1.0),
...     transformer_type=type(
...         sg_huggingface_utils.get_base_model_decoder(base_model)
...     ),
...     config=config,
... )
>>> batch_size, seq_length = 1, 10
>>> input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_length))
>>> attention_mask = torch.hstack(
...     [
...         torch.zeros((batch_size, 2), dtype=torch.bool),
...         torch.ones((batch_size, seq_length - 2), dtype=torch.bool),
...     ]
... )
>>> noise_mask = torch.randint(0, 2, (batch_size, seq_length, 1), dtype=torch.bool)

>>> transformed_embeddings = noisy_model.sample_transformed_embeddings(
...     input_ids, noise_mask, attention_mask=attention_mask, use_cache=True
... )
>>> reconstructed_ids = noisy_model.reconstruct_ids_from_embeddings(
...     transformed_embeddings
... )

reset_parameters ¶

reset_parameters() -> None

Reinitialize parameters and buffers.

This method is useful for initializing tensors created on the meta device.

restore_and_load ¶

restore_and_load() -> None

Restore the final decoder layers and final normalization layer and move them back to their original devices.

Raises:

Type	Description
`ValueError`	If the `truncated_layer_index` is `None`

sample_transformed_embeddings ¶

sample_transformed_embeddings(
    input_ids: Tensor, noise_mask: Tensor, **kwargs: Any
) -> torch.Tensor

Sample transformed embeddings for the given input token ids.

Parameters:

Name	Type	Description	Default
`input_ids` ¶	`Tensor`	The sequences of input tokens to transform.	required
`noise_mask` ¶	`Tensor`	The mask that selects the elements of `input_ids` to transform. Where the mask is `False`, the values of `input_ids` are passed through.	required
`**kwargs` ¶	`Any`	Additional keyword arguments to the noise layer's `forward` method.	required

Returns:

Type	Description
`torch.Tensor`	Sampled transformed embeddings.

Examples:

>>> from stainedglass_core.utils import huggingface as sg_huggingface_utils
>>> pretrained_model_name_or_path = (
...     "tests/resources/tokenizers/mini-Meta-Llama-3-8B"
... )
>>> config = transformers.AutoConfig.from_pretrained(pretrained_model_name_or_path)
>>> config.dtype = torch.float32
>>> base_model = transformers.AutoModelForCausalLM.from_pretrained(
...     pretrained_model_name_or_path,
...     dtype=torch.bfloat16,
... )
>>> noisy_model = NoiseMaskedNoisyTransformerModel(
...     transformer_cloak.TransformerCloak,
...     base_model,
...     target_layer="model.embed_tokens",
...     scale=(1e-8, 1.0),
...     transformer_type=type(
...         sg_huggingface_utils.get_base_model_decoder(base_model)
...     ),
...     config=config,
... )
>>> batch_size, seq_length = 1, 10
>>> input_ids = torch.randint(0, config.vocab_size, (batch_size, seq_length))
>>> attention_mask = torch.hstack(
...     [
...         torch.zeros((batch_size, 2), dtype=torch.bool),
...         torch.ones((batch_size, seq_length - 2), dtype=torch.bool),
...     ]
... )
>>> noise_mask = torch.randint(0, 2, (batch_size, seq_length, 1), dtype=torch.bool)

>>> transformed_embeddings = noisy_model.sample_transformed_embeddings(
...     input_ids, noise_mask, attention_mask=attention_mask, use_cache=True
... )

truncate_and_offload ¶

truncate_and_offload() -> None

Remove the decoder layers after truncated_layer_index and the final normalization layer from the model and move them to the CPU.

Decoder layer truncation improves runtime performance and lowers memory usage, but removes access to the logits layer and thereby sacrifices the computability of metrics such as perplexity.

Raises:

Type	Description
`ValueError`	If the `truncated_layer_index` is `None`

noisy_transformer_masking_model

NoiseMaskedNoisyTransformerModel ¶

noise_layer_class ¶

base_model ¶

target_layer ¶

target_parameter ¶

truncated_layer_index ¶

*args ¶

**kwargs ¶

input_embeddings cached property ¶

is_truncated_and_offloaded property ¶

single_precision_input_embeddings cached property ¶

target_layer property ¶

target_parameter property ¶

target_parameter_index cached property ¶

__getstate__ ¶

__setstate__ ¶

state ¶

trust_remote_code ¶

third_party_model_path ¶

distillation_context ¶

forward ¶

*args ¶

noise_mask ¶

**kwargs ¶

generate ¶

inputs ¶

*args ¶

noise_mask ¶

return_transformed_embeddings ¶

**kwargs ¶

reconstruct_ids_from_embeddings ¶

embeddings ¶

max_batch_size ¶

max_sequence_length ¶

max_num_embeddings ¶

reset_parameters ¶

restore_and_load ¶

sample_transformed_embeddings ¶

input_ids ¶

noise_mask ¶

**kwargs ¶

truncate_and_offload ¶

`noise_layer_class` ¶

`base_model` ¶

`target_layer` ¶

`target_parameter` ¶

`truncated_layer_index` ¶

`*args` ¶

`**kwargs` ¶

input_embeddings `cached` `property` ¶

is_truncated_and_offloaded `property` ¶

single_precision_input_embeddings `cached` `property` ¶

target_layer `property` ¶

target_parameter `property` ¶

target_parameter_index `cached` `property` ¶

getstate ¶

setstate ¶

`state` ¶

`trust_remote_code` ¶

`third_party_model_path` ¶

`*args` ¶

`noise_mask` ¶

`**kwargs` ¶

`inputs` ¶

`*args` ¶

`noise_mask` ¶

`return_transformed_embeddings` ¶

`**kwargs` ¶

`embeddings` ¶

`max_batch_size` ¶

`max_sequence_length` ¶

`max_num_embeddings` ¶

`input_ids` ¶

`noise_mask` ¶

`**kwargs` ¶