Skip to content

Evaluating Stained Glass Transform For LLMs

We can evaluate a Stained Glass Transform for LLMs against a wide variety of benchmarks using Stained Glass Core's extension of Eleuther AI's lm_eval harness, i.e., sglm_eval.

sglm_eval

The sglm_eval is a CLI utility for evaluating Stained Glass Transforms for LLMs.

Installation

sglm_eval comes bundled within the stainedglass_core package. Please refer to installation steps.

Usage

sglm_eval --model=sghflm \
    --model_args=parallelize=True,apply_stainedglass=True,transform_model_path=<enter_sgt_for_text_file_path>,pretrained=<base_model_directory>,max_length=8192,dtype=bfloat16 \
    --tasks=arc_challenge,arc_easy,openbookqa,piqa,truthfulqa_mc2 \
    --system_instruction "Cutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\n" \
    --device=cuda \
    --batch_size=20 \
    --trust_remote_code \
    --wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
    --num_fewshot=0 \
    --output_path /path/to/output/file

Command Breakdown

More details about lm_eval CLI here

  • --model=sghflm: Specify the evaluation model class (Currently only supports sghflm which extends the lm_eval.models.huggingface.HFLM class).
  • --model_args:
    • parallelize=True: Enables pipeline parallelization.
    • apply_stainedglass=True: Applies Stained Glass Transform to evaluation class. Set this to False to evaluate baseline models.
    • transform_model_path: Set path to your Stained Glass Transform file.
    • pretrained=<base_model_directory>: Set path to the base pretrained model weights.
    • max_length: Set the maximum token length for processing.
    • dtype: Configures PyTorch tensor dtype.
    • seed: Set seed specifically for StainedGlassTransformForText.
  • --tasks: List of evaluation tasks to run. More details about --tasks here:
    • arc_challenge: Advanced Reasoning Challenge (hard).
    • arc_easy: Advanced Reasoning Challenge (easy).
    • openbookqa: Open Book Question Answering.
    • piqa: Physical Interaction Question Answering.
    • truthfulqa_mc2: TruthfulQA (multiple-choice format, version 2).
    • hellaswag: Benchmark for commonsense reasoning and next-step prediction.
    • mmlu: Massive Multitask Language Understanding (MMLU) evaluates a model's ability to handle tasks across a wide variety of subjects.
  • --system_instruction: Specifies a system instruction string to prepend to the prompt. Currently only supported for llama3 family of models.
  • --device: Specifies device to be used for evaluation.
  • --batch_size: Specify batch size for model evaluation.
  • --trust_remote_code: Allows execution of remote code, required for certain models or extensions.
  • --wandb_args: Configures Weights & Biases (W&B) for tracking evaluation runs. More details about --wandb_args here:
    • project: Name of the W&B project.
    • entity: W&B entity (team or user).
    • resume: Manage resumption of an interrupted run.
    • name: Custom W&B run name.
    • job_type: Specifies W&B job type."
  • --num_fewshot=0: Sets the number of few-shot examples to place in context. Must be an integer.
  • --output_path: Path to save the evaluation results.
  • --apply_chat_template: If True, apply chat template to the prompt. Currently only supports True which is also the default.

Parallelization

lm_eval harness supports two types of parallelization: data parallelism and model parallelism (see details here). To make evaluation of very large models possible and more efficient, we added support for tensor parallelism and Fully Sharded Data Parallelism (FSDP), using torch.distributed.tensor.parallel and torch.distributed._composable.fsdp, respectively. This is enabled by setting tensor_parallel=True and fsdp=True in the model_args parameter and using torchrun to launch the evaluation. Setting either tensor_parallel or fsdp to True also overrides the parallelize parameter to False.

  1. For small models that can fit on a single GPU, use data parallelism as described here.

  2. For models that are too large to fit on a single GPU but can fit on a single node, use tensor parallelism is the most efficient way.

torchrun
    --nproc_per_node=<num_gpus>
    -m
    stainedglass_core.integrations.lm_eval
    --model=sghflm \
    --model_args=tensor_parallel=True,apply_stainedglass=True ...
  1. For models that are too large to fit on a single node, use both tensor parallelism and FSDP.
torchrun
    --nproc_per_node=<num_gpus>
    --nnodes=<num_nodes>
    -m
    stainedglass_core.integrations.lm_eval
    --model=sghflm \
    --model_args=fsdp=True,tensor_parallel=True,apply_stainedglass=True ...

Currently tensor parallel and fsdp is only supported for llama models.