Skip to content

Evaluating Stained Glass Transform For LLMs

We can evaluate a Stained Glass Transform for LLMs against a wide variety of benchmarks using Stained Glass Core's extension of Eleuther AI's lm_eval harness, i.e., sglm_eval.

sglm_eval

The sglm_eval is a CLI utility for evaluating Stained Glass Transforms for LLMs.

Installation

sglm_eval comes bundled within the stainedglass_core package. Please refer to installation steps.

Usage

sglm_eval --model=sghflm \
    --model_args=parallelize=True,apply_stainedglass=True,transform_model_path=<enter_sgt_for_text_file_path>,pretrained=<base_model_directory>,peft_id=<peft_adapter_directory>,peft_adapter_name=<adapter_name>,max_length=8192,dtype=bfloat16 \
    --tasks=arc_challenge,arc_easy,openbookqa,piqa,truthfulqa_mc2 \
    --device=cuda \
    --batch_size=20 \
    --trust_remote_code \
    --wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
    --num_fewshot=0 \
    --output_path /path/to/output/file

Command Breakdown

More details about lm_eval CLI here. In general sglm_eval extends the options available through lm_eval.

  • --model=sghflm: Specify the evaluation model class (Currently only supports sghflm which extends the lm_eval.models.huggingface.HFLM class).
  • --model_args:
  • parallelize=True: Enables pipeline parallelization.
  • apply_stainedglass=True: Applies Stained Glass Transform to evaluation class. Set this to False to evaluate baseline models.
  • transform_model_path: Set path to your Stained Glass Transform file.
  • pretrained=<base_model_directory>: Set path to the base pretrained model weights.
  • max_length: Set the maximum token length for processing.
  • dtype: Configures PyTorch tensor dtype.
  • seed: Set seed specifically for StainedGlassTransformForText.
  • peft: The path to the peft adapters to evaluate peft models.
  • --tasks: List of evaluation tasks to run. More details about --tasks here:
  • arc_challenge: Advanced Reasoning Challenge (hard).
  • arc_easy: Advanced Reasoning Challenge (easy).
  • openbookqa: Open Book Question Answering.
  • piqa: Physical Interaction Question Answering.
  • truthfulqa_mc2: TruthfulQA (multiple-choice format, version 2).
  • hellaswag: Benchmark for commonsense reasoning and next-step prediction.
  • mmlu: Massive Multitask Language Understanding (MMLU) evaluates a model's ability to handle tasks across a wide variety of subjects.
  • --device: Specifies device to be used for evaluation.
  • --batch_size: Specify batch size for model evaluation.
  • --trust_remote_code: Allows execution of remote code, required for certain models or extensions.
  • --wandb_args: Configures Weights & Biases (W&B) for tracking evaluation runs. More details about --wandb_args here:
  • project: Name of the W&B project.
  • entity: W&B entity (team or user).
  • resume: Manage resumption of an interrupted run.
  • name: Custom W&B run name.
  • job_type: Specifies W&B job type."
  • --num_fewshot=0: Sets the number of few-shot examples to place in context. Must be an integer.
  • --output_path: Path to save the evaluation results.
  • --apply_chat_template: If True, apply chat template to the prompt. Currently only supports True which is also the default.

Parallelization

lm_eval harness supports data parallelism and model parallelism (see details here). For multi-GPU evaluation, use the standard lm_eval parallelization approaches as described in the lm_eval documentation.

Evaluating via Stained Glass Proxy

Instead of loading the model and SGT transform directly inside sglm_eval, you can evaluate against a running Stained Glass proxy server that exposes an OpenAI-compatible chat completions endpoint. This is the recommended approach when:

  • The model is served via vLLM (better throughput, continuous batching).
  • You want to decouple the inference server from the evaluation harness.
  • The SGT transform is applied server-side and the harness only needs to send requests.

Prerequisites

Start the Stained Glass proxy before running evaluation. The proxy must be reachable at a known base_url (default url in Stained Glass proxy: http://localhost:8600/v1/chat/completions) and must be configured with the desired base model, SGT transform, and PEFT adapter.

Example Usage

sglm_eval --model=local-chat-completions \
    --model_args=model=<proxy_model_name>,base_url=http://localhost:<port>/v1/chat/completions \
    --tasks=<tasks_to_evaluate>\
    --batch_size=auto \
    --trust_remote_code \
    --output_path ./eval_results \
    --log_samples \
    --wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
    --gen_kwargs=<generation_arguments>,continue_final_message=True,add_generation_prompt=False

Command Breakdown

  • --model=local-chat-completions: Tells sglm_eval to send requests to an OpenAI-compatible chat completions endpoint rather than loading a model locally.
  • --model_args:
  • model=<proxy_model_name>: The model identifier registered in the proxy server (e.g., feasible-capybara-57). Must match the deployment name the proxy expects.
  • base_url=http://localhost:<port>/v1/chat/completions: Full URL of the proxy's chat completions endpoint.
  • --tasks: Same task identifiers as the direct sghflm approach (also see Task list above).
  • --batch_size: Use auto to let the harness determine batch size, or an integer. With a proxy backend, this controls how many requests are in-flight concurrently.
  • --gen_kwargs: Generation parameters forwarded to the proxy server:
  • temperature, top_p, top_k, min_p: Sampling parameters.
  • continue_final_message=True: Whether to continue generating from the last assistant turn. Set this to True if your query ends with an assistant turn and you want to complete this assistant turn.
  • add_generation_prompt=False: Whether to append an additional generation prompt token. Set to False if your query ends with an assistant turn and you want to complete this assistant turn.
  • Do not set continue_final_message and add_generation_prompt to both True or False simultaneously since they are contradicting settings.
  • --output_path: Directory where evaluation results are saved.
  • --log_samples: Save individual sample predictions alongside aggregate metrics.
  • --wandb_args: Same W&B tracking options as the direct approach.

Note

The --device flag is not required when using a proxy backend — the model runs on the server, not locally. Similarly, --num_fewshot defaults to auto; override with an integer if the task requires a fixed number of few-shot examples.