Stained Glass Transform fails to converge after choosing a shallow distillation layer. |
The depth of the chosen distillation layer can affect the performance of Stained Glass Transform during training. |
Search for the optimal distillation layer as if it were a hyperparameter. |
Base model performance degrades or diverges after choosing a shallow distillation layer. |
The depth of the chosen distillation layer can affect the performance of Stained Glass Transform during training. |
Search for the optimal distillation layer as if it were a hyperparameter. |
Choosing a distillation layer deep in the network causes the Stained Glass Transform to train too slowly, but model performance is largely unaffected. |
Deeper distillation layers could require higher alpha. |
Increase alpha and retrain. |
Loss used to train base model, using base model outputs, does not work with composite distillation loss. |
Composite distillation loss uses a loss function comparing the activations of a chosen "distillation layer" in the base model with and without transformed inputs. Base model outputs are not considered, and may not even be calculated. |
Use a loss function that compares the activations of the chosen distillation layer. One example which often works well in practice, is a scaled "vector cosine distance" between the input tensors, equivalent to 1-normalized_cosine_similarity . |
Model loss distillation diverges (or converges to a maximum, if bounded from above) within a few batches (often within one batch). |
The model loss distillation criterion function is formulated incorrectly. |
The model loss distillation criterion function should be at a minimum (and loss == 0) when its two input tensors are identical. Additionally, if the loss function is bounded from above, its maximum/supremum should correspond to when the two input tensors are as dissimilar as possible. One example which often works well in practice, is a scaled "vector cosine distance" between the input tensors, equivalent to 1-normalized_cosine_similarity . |
Model loss distillation diverges (or converges to a maximum, if bounded from above) with a few batches (often within one batch). |
Numerical precision issues in the model loss criterion function cause vanishing or exploding gradients in the Stained Glass Transform. |
If calculating the model loss criterion function in bfloat16 or fp16 , try instead calculating it in single precision (fp32 ). |
Multiple Stained Glass Transform outputs (for a fixed input) do not vary sufficiently. |
The transform variance priority (or alpha) is too low. |
Increase the transform variance priority (and potentially increase alpha) and retrain. |
Transformed embeddings do not vary sufficiently from the input embeddings. |
The transform strength priority (or alpha) is too low. |
Increase the transform strength priority (and potentially increase alpha) and retrain. |