cosine

Module for cosine similarity and distance loss functions.

Functions:

Name	Description
`batched_normalized_cosine_dist`	Compute the normalized cosine distance between `query` and `embedding_index` pairwise.
`normalized_cosine_distance`	Calculate the cosine distance (negative cosine similarity) between two tensors, scaled and shifted into the range [0, 1].
`normalized_cosine_similarity`	Calculate the cosine similarity between two tensors, scaled and shifted into the range [0, 1].
`squared_hinge_loss`	Compute a squared-hinge penalty on a value above a target ceiling.
`vision_cosine_distillation_loss`	Compute mean cosine-distance loss between teacher (clean) and student (noisy) features.
`vision_feature_cosine_similarity`	Compute mean cosine similarity between teacher (clean) and student (noisy) features.

absolute_cosine_similarity ¶

absolute_cosine_similarity(
    x0: Tensor, x1: Tensor, noise_mask: Tensor | None = None
) -> torch.Tensor

Calculate the absolute cosine similarity between two tensors, masked by a noise mask.

When used as a loss it encourages the two tensors to be orthogonal.

Parameters:

Name	Type	Description	Default
`x0` ¶	`Tensor`	The first tensor.	required
`x1` ¶	`Tensor`	The second tensor.	required
`noise_mask` ¶	`Tensor \| None`	A boolean mask indicating which elements to include in the calculation.	`None`

Returns:

Type	Description
`torch.Tensor`	The mean absolute cosine similarity between the two tensors, masked by the noise mask.

batched_normalized_cosine_dist ¶

batched_normalized_cosine_dist(
    query: Tensor, embedding_index: Tensor, p: int = 2
) -> torch.Tensor

Compute the normalized cosine distance between query and embedding_index pairwise.

Note: We choose to use the square root in the implementation to ensure the implementation is a valid distance metric.

Parameters:

Name	Type	Description	Default
`query` ¶	`Tensor`	An n-dimensional tensor of shape (*, embedding_dim).	required
`embedding_index` ¶	`Tensor`	A tensor of shape (n_embeddings, embedding_dim).	required
`p` ¶	`int`	The p-norm to use for normalization. Defaults to 2 for standard Euclidean normalization.	`2`

Returns:

Type	Description
`torch.Tensor`	A tensor of shape (*, n_embeddings) containing the normalized cosine distances between the input tensors.

Examples:

>>> query = torch.tensor([[1.0, 0.0], [0.0, 1.0]])
>>> embedding_index = torch.tensor([[1.0, 0.0], [0.0, 1.0]])
>>> batched_normalized_cosine_dist(query, embedding_index)
tensor([[0.0000, 0.7071],
        [0.7071, 0.0000]])

Added in version v2.23.0. Added batched normalized cosine distance function.

normalized_cosine_distance ¶

normalized_cosine_distance(
    x1: Tensor, x2: Tensor, dim: int = 1, eps: float = 1e-08
) -> torch.Tensor

Calculate the cosine distance (negative cosine similarity) between two tensors, scaled and shifted into the range [0, 1].

Parameters:

Name	Type	Description	Default
`x1` ¶	`Tensor`	The first tensor.	required
`x2` ¶	`Tensor`	The second tensor.	required
`dim` ¶	`int`	The dimension along which cosine distance is computed.	`1`
`eps` ¶	`float`	A small value to prevent division by zero.	`1e-08`

Returns:

Type	Description
`torch.Tensor`	The cosine distance of the tensors, scaled and shifted to between 0 and 1.

normalized_cosine_similarity ¶

normalized_cosine_similarity(
    x1: Tensor, x2: Tensor, dim: int = 1, eps: float = 1e-08
) -> torch.Tensor

Calculate the cosine similarity between two tensors, scaled and shifted into the range [0, 1].

Parameters:

Name	Type	Description	Default
`x1` ¶	`Tensor`	The first tensor.	required
`x2` ¶	`Tensor`	The second tensor.	required
`dim` ¶	`int`	The dimension along which cosine similarity is computed.	`1`
`eps` ¶	`float`	A small value to prevent division by zero.	`1e-08`

Returns:

Type	Description
`torch.Tensor`	The cosine similarity of the tensors, scaled and shifted to between 0 and 1.

squared_hinge_loss ¶

squared_hinge_loss(
    value: Tensor, *, target: float
) -> torch.Tensor

Compute a squared-hinge penalty on a value above a target ceiling.

The loss is relu(value - target) ** 2: zero once value <= target, and quadratic in the margin above it. Used to push a clean/noisy feature similarity (e.g. from vision_feature_cosine_similarity) below a privacy target without forcing antipodal collapse — the penalty vanishes once the features have diverged past target, so it does not reward over-divergence.

Privacy note: gradients flow through value into whatever produced it (typically the noisy/student branch of a Stained Glass Transform). A higher target is a weaker privacy constraint.

Parameters:

Name	Type	Description	Default
`value` ¶	`Tensor`	Value to penalize, typically a scalar. Element-wise hinge is applied for non-scalar inputs.	required
`target` ¶	`float`	Ceiling above which the squared margin is penalized.	required

Returns:

Type	Description
`torch.Tensor`	The squared-hinge penalty, matching the shape of `value`.

Example

At or below the target the penalty is zero.¶

squared_hinge_loss(torch.tensor(0.3), target=0.5).item() 0.0

Above the target the penalty is the squared margin: (0.8 - 0.5) ** 2 == 0.09.¶

round(squared_hinge_loss(torch.tensor(0.8), target=0.5).item(), 4) 0.09

Added in version v3.49.0. Squared-hinge penalty on a value above a target ceiling. Zero once the value has fallen to or below the target; quadratic above it.

vision_cosine_distillation_loss ¶

vision_cosine_distillation_loss(
    teacher: Tensor, student: Tensor
) -> torch.Tensor

Compute mean cosine-distance loss between teacher (clean) and student (noisy) features.

Both tensors must share leading shape. Cosine similarity is taken along the last (feature) axis, then mean-reduced over all preceding axes. The teacher tensor is detached so gradients flow only through the student path; both tensors are promoted to float32 to keep the cosine numerically stable when the upstream model runs in bfloat16.

Typical use case: when training a vision-tower-aware Stained Glass Transform, the post-merger output of the vision encoder is the per-image-token feature stream spliced into the LLM's input embeddings. Pulling the noisy branch's stream toward the clean branch's adds a self-distillation signal that complements the language-head cross-entropy / KL — privacy bounds (MS-SSIM, std-log) keep the cloak from collapsing to identity.

Privacy note: student carries gradients into the Stained Glass Transform parameters. teacher is detached and contributes only as a target.

Parameters:

Name	Type	Description	Default
`teacher` ¶	`Tensor`	Clean-branch features of shape `(..., feature_dim)`. Detached internally.	required
`student` ¶	`Tensor`	Noisy-branch features of the same shape as `teacher`.	required

Returns:

Type	Description
`torch.Tensor`	Scalar mean cosine distance `1 - cos(teacher, student)` in `[0, 2]`.

Example

teacher = torch.randn(2, 4, 8)

Identical features yield a distance at the float32 precision floor.¶

vision_cosine_distillation_loss(teacher, teacher.clone()).item() < 1e-6 True

Added in version v3.41.0. Self-distillation loss between a detached teacher and a student feature stream. Computes in float32 for bf16 stability and averages over all dimensions preceding the feature axis.

vision_feature_cosine_similarity ¶

vision_feature_cosine_similarity(
    teacher: Tensor, student: Tensor
) -> torch.Tensor

Compute mean cosine similarity between teacher (clean) and student (noisy) features.

Both tensors must share leading shape. Cosine similarity is taken along the last (feature) axis, then mean-reduced over all preceding axes. The teacher tensor is detached so gradients flow only through the student path; the computation runs in float32 (via compute_in_precision) to keep the cosine numerically stable when the upstream model runs in bfloat16.

This is the diagnostic complement of vision_cosine_distillation_loss and equals 1 - vision_cosine_distillation_loss(teacher, student). When training a vision-tower-aware Stained Glass Transform, it measures how aligned the noisy branch's per-image-token features remain with the clean branch's — a privacy diagnostic that pairs with squared_hinge_loss to penalize features that stay too similar.

Privacy note: student carries gradients into the Stained Glass Transform parameters. teacher is detached and contributes only as a target.

Parameters:

Name	Type	Description	Default
`teacher` ¶	`Tensor`	Clean-branch features of shape `(..., feature_dim)`. Detached internally.	required
`student` ¶	`Tensor`	Noisy-branch features of the same shape as `teacher`.	required

Returns:

Type	Description
`torch.Tensor`	Scalar mean cosine similarity `cos(teacher, student)` in `[-1, 1]`.

Example

teacher = torch.randn(2, 4, 8)

Identical features yield a similarity at the float32 precision ceiling.¶

vision_feature_cosine_similarity(teacher, teacher.clone()).item() > 1.0 - 1e-6 True

Added in version v3.49.0. Mean cosine similarity between a detached teacher (clean) and a student (noisy) feature stream. Companion metric to `vision_cosine_distillation_loss`; computes in float32 for bf16 stability and averages over all dimensions preceding the feature axis.

cosine

absolute_cosine_similarity ¶

`x0` ¶

`x1` ¶

`noise_mask` ¶

batched_normalized_cosine_dist ¶

`query` ¶

`embedding_index` ¶

`p` ¶

normalized_cosine_distance ¶

`x1` ¶

`x2` ¶

`dim` ¶

`eps` ¶

normalized_cosine_similarity ¶

`x1` ¶

`x2` ¶

`dim` ¶

`eps` ¶

squared_hinge_loss ¶

`value` ¶

`target` ¶

At or below the target the penalty is zero.¶

Above the target the penalty is the squared margin: (0.8 - 0.5) ** 2 == 0.09.¶

vision_cosine_distillation_loss ¶

`teacher` ¶

`student` ¶

Identical features yield a distance at the float32 precision floor.¶

vision_feature_cosine_similarity ¶

`teacher` ¶

`student` ¶

Identical features yield a similarity at the float32 precision ceiling.¶

cosine

absolute_cosine_similarity ¶

x0 ¶

x1 ¶

noise_mask ¶

batched_normalized_cosine_dist ¶

query ¶

embedding_index ¶

p ¶

normalized_cosine_distance ¶

x1 ¶

x2 ¶

dim ¶

eps ¶

normalized_cosine_similarity ¶

x1 ¶

x2 ¶

dim ¶

eps ¶

squared_hinge_loss ¶

value ¶

target ¶

At or below the target the penalty is zero.¶

Above the target the penalty is the squared margin: (0.8 - 0.5) ** 2 == 0.09.¶

vision_cosine_distillation_loss ¶

teacher ¶

student ¶

Identical features yield a distance at the float32 precision floor.¶

vision_feature_cosine_similarity ¶

teacher ¶

student ¶

Identical features yield a similarity at the float32 precision ceiling.¶

`x0` ¶

`x1` ¶

`noise_mask` ¶

`query` ¶

`embedding_index` ¶

`p` ¶

`x1` ¶

`x2` ¶

`dim` ¶

`eps` ¶

`x1` ¶

`x2` ¶

`dim` ¶

`eps` ¶

`value` ¶

`target` ¶

`teacher` ¶

`student` ¶

`teacher` ¶

`student` ¶