Evaluating Stained Glass Transform For LLMs¶

We can evaluate a Stained Glass Transform for LLMs against a wide variety of benchmarks using Stained Glass Core's extension of Eleuther AI's lm_eval harness, i.e., sglm_eval.

`sglm_eval`¶

The sglm_eval is a CLI utility for evaluating Stained Glass Transforms for LLMs.

Installation¶

sglm_eval comes bundled within the stainedglass_core package. Please refer to installation steps.

Usage¶

sglm_eval --model=sghflm \
    --model_args=parallelize=True,apply_stainedglass=True,transform_model_path=<enter_sgt_for_text_file_path>,pretrained=<base_model_directory>,peft_id=<peft_adapter_directory>,peft_adapter_name=<adapter_name>,max_length=8192,dtype=bfloat16 \
    --tasks=arc_challenge,arc_easy,openbookqa,piqa,truthfulqa_mc2 \
    --device=cuda \
    --batch_size=20 \
    --trust_remote_code \
    --wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
    --num_fewshot=0 \
    --output_path /path/to/output/file

Command Breakdown¶

More details about lm_eval CLI here. In general sglm_eval extends the options available through lm_eval.

--model=sghflm: Specify the evaluation model class (Currently only supports sghflm which extends the lm_eval.models.huggingface.HFLM class).
--model_args:
parallelize=True: Enables pipeline parallelization.
apply_stainedglass=True: Applies Stained Glass Transform to evaluation class. Set this to False to evaluate baseline models.
transform_model_path: Set path to your Stained Glass Transform file.
pretrained=<base_model_directory>: Set path to the base pretrained model weights.
max_length: Set the maximum token length for processing.
dtype: Configures PyTorch tensor dtype.
seed: Set seed specifically for StainedGlassTransformForText.
peft: The path to the peft adapters to evaluate peft models.
--tasks: List of evaluation tasks to run. More details about --tasks here:
arc_challenge: Advanced Reasoning Challenge (hard).
arc_easy: Advanced Reasoning Challenge (easy).
openbookqa: Open Book Question Answering.
piqa: Physical Interaction Question Answering.
truthfulqa_mc2: TruthfulQA (multiple-choice format, version 2).
hellaswag: Benchmark for commonsense reasoning and next-step prediction.
mmlu: Massive Multitask Language Understanding (MMLU) evaluates a model's ability to handle tasks across a wide variety of subjects.
--device: Specifies device to be used for evaluation.
--batch_size: Specify batch size for model evaluation.
--trust_remote_code: Allows execution of remote code, required for certain models or extensions.
--wandb_args: Configures Weights & Biases (W&B) for tracking evaluation runs. More details about --wandb_args here:
project: Name of the W&B project.
entity: W&B entity (team or user).
resume: Manage resumption of an interrupted run.
name: Custom W&B run name.
job_type: Specifies W&B job type."
--num_fewshot=0: Sets the number of few-shot examples to place in context. Must be an integer.
--output_path: Path to save the evaluation results.
--apply_chat_template: If True, apply chat template to the prompt. Currently only supports True which is also the default.

Parallelization¶

lm_eval harness supports data parallelism and model parallelism (see details here). For multi-GPU evaluation, use the standard lm_eval parallelization approaches as described in the lm_eval documentation.

Evaluating via Stained Glass Proxy¶

Instead of loading the model and SGT transform directly inside sglm_eval, you can evaluate against a running Stained Glass proxy server that exposes an OpenAI-compatible chat completions endpoint. This is the recommended approach when:

The model is served via vLLM (better throughput, continuous batching).
You want to decouple the inference server from the evaluation harness.
The SGT transform is applied server-side and the harness only needs to send requests.

Prerequisites¶

Start the Stained Glass proxy before running evaluation. The proxy must be reachable at a known base_url (default url in Stained Glass proxy: http://localhost:8600/v1/chat/completions) and must be configured with the desired base model, SGT transform, and PEFT adapter.

Example Usage¶

sglm_eval --model=local-chat-completions \
    --model_args=model=<proxy_model_name>,base_url=http://localhost:<port>/v1/chat/completions \
    --tasks=<tasks_to_evaluate>\
    --batch_size=auto \
    --trust_remote_code \
    --output_path ./eval_results \
    --log_samples \
    --wandb_args=project=<wandb_project_name>,entity=core,resume=allow,name=<custom-run-name>,job_type=eval \
    --gen_kwargs=<generation_arguments>,continue_final_message=True,add_generation_prompt=False

Command Breakdown¶

--model=local-chat-completions: Tells sglm_eval to send requests to an OpenAI-compatible chat completions endpoint rather than loading a model locally.
--model_args:
model=<proxy_model_name>: The model identifier registered in the proxy server (e.g., feasible-capybara-57). Must match the deployment name the proxy expects.
base_url=http://localhost:<port>/v1/chat/completions: Full URL of the proxy's chat completions endpoint.
--tasks: Same task identifiers as the direct sghflm approach (also see Task list above).
--batch_size: Use auto to let the harness determine batch size, or an integer. With a proxy backend, this controls how many requests are in-flight concurrently.
--gen_kwargs: Generation parameters forwarded to the proxy server:
temperature, top_p, top_k, min_p: Sampling parameters.
continue_final_message=True: Whether to continue generating from the last assistant turn. Set this to True if your query ends with an assistant turn and you want to complete this assistant turn.
add_generation_prompt=False: Whether to append an additional generation prompt token. Set to False if your query ends with an assistant turn and you want to complete this assistant turn.
Do not set continue_final_message and add_generation_prompt to both True or False simultaneously since they are contradicting settings.
--output_path: Directory where evaluation results are saved.
--log_samples: Save individual sample predictions alongside aggregate metrics.
--wandb_args: Same W&B tracking options as the direct approach.

Note

The --device flag is not required when using a proxy backend — the model runs on the server, not locally. Similarly, --num_fewshot defaults to auto; override with an integer if the task requires a fixed number of few-shot examples.

Evaluating Stained Glass Transform For LLMs¶

sglm_eval¶

Installation¶

Usage¶

Command Breakdown¶

Parallelization¶

Evaluating via Stained Glass Proxy¶

Prerequisites¶

Example Usage¶

Command Breakdown¶

`sglm_eval`¶