Evaluating Stained Glass Transform For LLMs¶
We can evaluate a Stained Glass Transform for LLMs against a wide variety of benchmarks using Stained Glass Core's extension of Eleuther AI's lm_eval harness, i.e., sglm_eval.
sglm_eval¶
The sglm_eval is a CLI utility for evaluating Stained Glass Transforms for LLMs.
Installation¶
sglm_eval comes bundled within the stainedglass_core package. Please refer to installation steps.
Usage¶
sglm_eval --model=sghflm \
--model_args=parallelize=True,apply_stainedglass=True,transform_model_path=<enter_sgt_for_text_file_path>,pretrained=<base_model_directory>,peft_id=<peft_adapter_directory>,peft_adapter_name=<adapter_name>,max_length=8192,dtype=bfloat16 \
--tasks=arc_challenge,arc_easy,openbookqa,piqa,truthfulqa_mc2 \
--device=cuda \
--batch_size=20 \
--trust_remote_code \
--wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
--num_fewshot=0 \
--output_path /path/to/output/file
Command Breakdown¶
More details about lm_eval CLI here
--model=sghflm: Specify the evaluation model class (Currently only supportssghflmwhich extends thelm_eval.models.huggingface.HFLMclass).--model_args:parallelize=True: Enables pipeline parallelization.apply_stainedglass=True: Applies Stained Glass Transform to evaluation class. Set this toFalseto evaluate baseline models.transform_model_path: Set path to your Stained Glass Transform file.pretrained=<base_model_directory>: Set path to the base pretrained model weights.max_length: Set the maximum token length for processing.dtype: Configures PyTorch tensordtype.seed: Set seed specifically forStainedGlassTransformForText.peft: The path to the peft adapters to evaluate peft models.--tasks: List of evaluation tasks to run. More details about--taskshere:arc_challenge: Advanced Reasoning Challenge (hard).arc_easy: Advanced Reasoning Challenge (easy).openbookqa: Open Book Question Answering.piqa: Physical Interaction Question Answering.truthfulqa_mc2: TruthfulQA (multiple-choice format, version 2).hellaswag: Benchmark for commonsense reasoning and next-step prediction.mmlu: Massive Multitask Language Understanding (MMLU) evaluates a model's ability to handle tasks across a wide variety of subjects.--device: Specifies device to be used for evaluation.--batch_size: Specify batch size for model evaluation.--trust_remote_code: Allows execution of remote code, required for certain models or extensions.--wandb_args: Configures Weights & Biases (W&B) for tracking evaluation runs. More details about--wandb_argshere:project: Name of the W&B project.entity: W&B entity (team or user).resume: Manage resumption of an interrupted run.name: Custom W&B run name.job_type: Specifies W&B job type."--num_fewshot=0: Sets the number of few-shot examples to place in context. Must be an integer.--output_path: Path to save the evaluation results.--apply_chat_template: If True, apply chat template to the prompt. Currently only supportsTruewhich is also the default.
Parallelization¶
lm_eval harness supports data parallelism and model parallelism (see details here). For multi-GPU evaluation, use the standard lm_eval parallelization approaches as described in the lm_eval documentation.