Evaluating Stained Glass Transform For LLMs¶

We can evaluate a Stained Glass Transform for LLMs against a wide variety of benchmarks using Stained Glass Core's extension of Eleuther AI's lm_eval harness, i.e., sglm_eval.

`sglm_eval`¶

The sglm_eval is a CLI utility for evaluating Stained Glass Transforms for LLMs.

Installation¶

sglm_eval comes bundled within the stainedglass_core package. Please refer to installation steps.

Usage¶

sglm_eval --model=sghflm \
    --model_args=parallelize=True,apply_stainedglass=True,transform_model_path=<enter_sgt_for_text_file_path>,pretrained=<base_model_directory>,max_length=8192,dtype=bfloat16 \
    --tasks=arc_challenge,arc_easy,openbookqa,piqa,truthfulqa_mc2 \
    --system_instruction "Cutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n" \
    --device=cuda \
    --batch_size=20 \
    --trust_remote_code \
    --wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
    --num_fewshot=0 \
    --output_path /path/to/output/file

Command Breakdown¶

More details about lm_eval CLI here

--model=sghflm: Specify the evaluation model class (Currently only supports sghflm which extends the lm_eval.models.huggingface.HFLM class).
--model_args:
- parallelize=True: Enables model parallelization for efficient computation.
- apply_stainedglass=True: Applies Stained Glass Transform to evaluation class. Set this to False to evaluate baseline models.
- transform_model_path: Set path to your Stained Glass Transform file.
- pretrained=<base_model_directory>: Set path to the base pretrained model weights.
- max_length: Set the maximum token length for processing.
- dtype: Configures PyTorch tensor dtype.
- seed: Set seed specifically for StainedGlassTransformForText.
--tasks: List of evaluation tasks to run. More details about --tasks here:
- arc_challenge: Advanced Reasoning Challenge (hard).
- arc_easy: Advanced Reasoning Challenge (easy).
- openbookqa: Open Book Question Answering.
- piqa: Physical Interaction Question Answering.
- truthfulqa_mc2: TruthfulQA (multiple-choice format, version 2).
- hellaswag: Benchmark for commonsense reasoning and next-step prediction.
- mmlu: Massive Multitask Language Understanding (MMLU) evaluates a model's ability to handle tasks across a wide variety of subjects.
--system_instruction: Specifies a system instruction string to prepend to the prompt. Currently only supported for llama3 family of models.
--device: Specifies device to be used for evaluation.
--batch_size: Specify batch size for model evaluation.
--trust_remote_code: Allows execution of remote code, required for certain models or extensions.
--wandb_args: Configures Weights & Biases (W&B) for tracking evaluation runs. More details about --wandb_args here:
- project: Name of the W&B project.
- entity: W&B entity (team or user).
- resume: Manage resumption of an interrupted run.
- name: Custom W&B run name.
- job_type: Specifies W&B job type."
--num_fewshot=0: Sets the number of few-shot examples to place in context. Must be an integer.
--output_path: Path to save the evaluation results.
--apply_chat_template: If True, apply chat template to the prompt. Currently only supports True which is also the default.
--seed: Set seed for python's random, numpy and torch. Accepts a comma-separated list of 3 values for python's random, numpy,torch and fewshot seeds, respectively, or a single integer to set the same seed for all three. The values are either an integer or 'None' to not set the seed. Default is 0,1234,1234,1234 (for backward compatibility). E.g. --seed 0,None,8 sets random.seed(0) and torch.manual_seed(8). Here numpy's seed is not set since the second value is None. E.g, --seed 42 sets all three seeds to 42.

[!NOTE] The torch seed value from the --seed argument will be used as the seed value in --model_args unless its explicitly set.

Evaluating Stained Glass Transform For LLMs¶

sglm_eval¶

Installation¶

Usage¶

Command Breakdown¶

`sglm_eval`¶