Skip to content

Troubleshooting guide

Training Stained Glass Transform

Issue Cause Solution
Stained Glass Transform is not applied. Base model was not wrapped in NoisyModel. Wrap the base model in NoisyModel.
Stained Glass Transform does not appear to be applied after training, but the base model is wrapped in a NoisyModel. Alpha is too low, causing the base loss to dominate. Increase alpha and retrain.
Stained Glass Transform is weak after training. Alpha is too low, causing the base loss to dominate. Increase alpha and retrain.
Stained Glass Transform is too strong, hurting model performance during training. Alpha is too high. Decrease alpha and retrain.
Stained Glass Transform does not appear to train. Base loss has not been wrapped with a transform loss wrapper. Wrap the base loss in a transform loss wrapper.

Training using composite distillation loss

In addition to the common issues listed above, the following issues may arise when training using composite distillation loss.

Issue Cause Solution
Stained Glass Transform fails to converge after choosing a shallow distillation layer. The depth of the chosen distillation layer can affect the performance of Stained Glass Transform during training. Search for the optimal distillation layer as if it were a hyperparameter.
Base model performance degrades or diverges after choosing a shallow distillation layer. The depth of the chosen distillation layer can affect the performance of Stained Glass Transform during training. Search for the optimal distillation layer as if it were a hyperparameter.
Choosing a distillation layer deep in the network causes the Stained Glass Transform to train too slowly, but model performance is largely unaffected. Deeper distillation layers could require higher alpha. Increase alpha and retrain.
Loss used to train base model, using base model outputs, does not work with composite distillation loss. Composite distillation loss uses a loss function comparing the activations of a chosen "distillation layer" in the base model with and without transformed inputs. Base model outputs are not considered, and may not even be calculated. Use a loss function that compares the activations of the chosen distillation layer. One example which often works well in practice, is a scaled "vector cosine distance" between the input tensors, equivalent to 1-normalized_cosine_similarity.
Model loss distillation diverges (or converges to a maximum, if bounded from above) within a few batches (often within one batch). The model loss distillation criterion function is formulated incorrectly. The model loss distillation criterion function should be at a minimum (and loss == 0) when its two input tensors are identical. Additionally, if the loss function is bounded from above, its maximum/supremum should correspond to when the two input tensors are as dissimilar as possible. One example which often works well in practice, is a scaled "vector cosine distance" between the input tensors, equivalent to 1-normalized_cosine_similarity.
Model loss distillation diverges (or converges to a maximum, if bounded from above) with a few batches (often within one batch). Numerical precision issues in the model loss criterion function cause vanishing or exploding gradients in the Stained Glass Transform. If calculating the model loss criterion function in bfloat16 or fp16, try instead calculating it in single precision (fp32).
Multiple Stained Glass Transform outputs (for a fixed input) do not vary sufficiently. The transform variance priority (or alpha) is too low. Increase the transform variance priority (and potentially increase alpha) and retrain.
Transformed embeddings do not vary sufficiently from the input embeddings. The transform strength priority (or alpha) is too low. Increase the transform strength priority (and potentially increase alpha) and retrain.

Inference using Stained Glass Transform

Large Language Models

Issue Cause Solution
Embeddings passed to the model are not transformed. Stained Glass Transform is not applied to the embeddings. Apply the Stained Glass Transform to the embeddings. We recommend using a StainedGlassTransformForText module for inference.
Generation devolves to nonsense after 1 or 2 tokens. Stained Glass Transform hook was not removed from the Base Model, causing the transform to automatically be re-applied to the input after every new token. We recommend using a StainedGlassTransformForText module for inference.