| Stained Glass Transform fails to converge after choosing a shallow distillation layer. | The depth of the chosen distillation layer can affect the performance of Stained Glass Transform during training. | Search for the optimal distillation layer as if it were a hyperparameter. | 
| Base model performance degrades or diverges after choosing a shallow distillation layer. | The depth of the chosen distillation layer can affect the performance of Stained Glass Transform during training. | Search for the optimal distillation layer as if it were a hyperparameter. | 
| Choosing a distillation layer deep in the network causes the Stained Glass Transform to train too slowly, but model performance is largely unaffected. | Deeper distillation layers could require higher alpha. | Increase alpha and retrain. | 
| Loss used to train base model, using base model outputs, does not work with composite distillation loss. | Composite distillation loss uses a loss function comparing the activations of a chosen "distillation layer" in the base model with and without transformed inputs. Base model outputs are not considered, and may not even be calculated. | Use a loss function that compares the activations of the chosen distillation layer. One example which often works well in practice, is a scaled "vector cosine distance" between the input tensors, equivalent to 1-normalized_cosine_similarity. | 
| Model loss distillation diverges (or converges to a maximum, if bounded from above) within a few batches (often within one batch). | The model loss distillation criterion function is formulated incorrectly. | The model loss distillation criterion function should be at a minimum (and loss == 0) when its two input tensors are identical. Additionally, if the loss function is bounded from above, its maximum/supremum should correspond to when the two input tensors are as dissimilar as possible. One example which often works well in practice, is a scaled "vector cosine distance" between the input tensors, equivalent to 1-normalized_cosine_similarity. | 
| Model loss distillation diverges (or converges to a maximum, if bounded from above) with a few batches (often within one batch). | Numerical precision issues in the model loss criterion function cause vanishing or exploding gradients in the Stained Glass Transform. | If calculating the model loss criterion function in bfloat16orfp16, try instead calculating it in single precision (fp32). | 
| Multiple Stained Glass Transform outputs (for a fixed input) do not vary sufficiently. | The transform variance priority (or alpha) is too low. | Increase the transform variance priority (and potentially increase alpha) and retrain. | 
| Transformed embeddings do not vary sufficiently from the input embeddings. | The transform strength priority (or alpha) is too low. | Increase the transform strength priority (and potentially increase alpha) and retrain. |