Evaluating Stained Glass Transform For LLMs¶
We can evaluate a Stained Glass Transform for LLMs against a wide variety of benchmarks using Stained Glass Core's extension of Eleuther AI's lm_eval
harness, i.e., sglm_eval
.
sglm_eval
¶
The sglm_eval
is a CLI utility for evaluating Stained Glass Transforms for LLMs.
Installation¶
sglm_eval
comes bundled within the stainedglass_core
package. Please refer to installation steps.
Usage¶
sglm_eval --model=sghflm \
--model_args=parallelize=True,apply_stainedglass=True,transform_model_path=<enter_sgt_for_text_file_path>,pretrained=<base_model_directory>,max_length=8192,dtype=bfloat16 \
--tasks=arc_challenge,arc_easy,openbookqa,piqa,truthfulqa_mc2 \
--system_instruction "Cutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n" \
--device=cuda \
--batch_size=20 \
--trust_remote_code \
--wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
--num_fewshot=0 \
--output_path /path/to/output/file
Command Breakdown¶
More details about lm_eval
CLI here
--model=sghflm
: Specify the evaluation model class (Currently only supportssghflm
which extends thelm_eval.models.huggingface.HFLM
class).--model_args
:parallelize=True
: Enables model parallelization for efficient computation.apply_stainedglass=True
: Applies Stained Glass Transform to evaluation class. Set this toFalse
to evaluate baseline models.transform_model_path
: Set path to your Stained Glass Transform file.pretrained=<base_model_directory>
: Set path to the base pretrained model weights.max_length
: Set the maximum token length for processing.dtype
: Configures PyTorch tensordtype
.seed
: Set seed specifically forStainedGlassTransformForText
.
--tasks
: List of evaluation tasks to run. More details about--tasks
here:arc_challenge
: Advanced Reasoning Challenge (hard).arc_easy
: Advanced Reasoning Challenge (easy).openbookqa
: Open Book Question Answering.piqa
: Physical Interaction Question Answering.truthfulqa_mc2
: TruthfulQA (multiple-choice format, version 2).hellaswag
: Benchmark for commonsense reasoning and next-step prediction.mmlu
: Massive Multitask Language Understanding (MMLU) evaluates a model's ability to handle tasks across a wide variety of subjects.
--system_instruction
: Specifies a system instruction string to prepend to the prompt. Currently only supported for llama3 family of models.--device
: Specifies device to be used for evaluation.--batch_size
: Specify batch size for model evaluation.--trust_remote_code
: Allows execution of remote code, required for certain models or extensions.--wandb_args
: Configures Weights & Biases (W&B) for tracking evaluation runs. More details about--wandb_args
here:project
: Name of the W&B project.entity
: W&B entity (team or user).resume
: Manage resumption of an interrupted run.name
: Custom W&B run name.job_type
: Specifies W&B job type."
--num_fewshot=0
: Sets the number of few-shot examples to place in context. Must be an integer.--output_path
: Path to save the evaluation results.--apply_chat_template
: If True, apply chat template to the prompt. Currently only supportsTrue
which is also the default.--seed
: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy,torch and fewshot seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is 0,1234,1234,1234 (for backward compatibility). E.g. --seed 0,None,8 sets random.seed(0) and torch.manual_seed(8). Here numpy's seed is not set since the second value is None. E.g, --seed 42 sets all three seeds to 42.[!NOTE] The torch seed value from the
--seed
argument will be used as theseed
value in--model_args
unless its explicitly set.