Demo Usage#

install from Source#

Installing from source means you can pull to update.

First, clone the repo:

Cloning into 'benchtools'...
remote: Enumerating objects: 908, done.
remote: Counting objects: 100% (277/277), done.
remote: Compressing objects: 100% (165/165), done.
remote: Total 908 (delta 145), reused 170 (delta 83), pack-reused 631 (from 2)
Receiving objects: 100% (908/908), 2.34 MiB | 405.00 KiB/s, done.
Resolving deltas: 100% (513/513), done.

See it creates a folder

benchtools

Then install:

Important

this needs to be benchtools/ for it to be the path; benchtools will try to pull from pypi. Alternatively, cd benchtools then pip install .

Processing ./benchtools
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: click in /Users/brownsarahm/miniforge3/lib/python3.12/site-packages (from benchtools==0.2.0) (8.3.0)

Successfully built benchtools
Installing collected packages: benchtools
  Attempting uninstall: benchtools
    Found existing installation: benchtools 0.2.0
    Uninstalling benchtools-0.2.0:
      Successfully uninstalled benchtools-0.2.0
Successfully installed benchtools-0.2.0

Note

the above is truncated, but the last few lines are the most important

Exploring the Demo benchmarks#

we have two tiny demo benchmarks in a demos folder:

folderbench	listbench

benchtools supports two formats for storing tasks:

  • a list of tasks in a single .yml file

  • a set of folders with each task having its own.

Let’s examine the folder-based example first:

README.md	tasks

The tasks folder is the main content:

it has two tasks, one folder for each

add	symbols

We can look inside one:

it has two files: a template and values

template.txt	values.csv
what is {a} + {b}?
a,b,reference
2,3,5
4,5,9
8,9,17

Important

The columns in the csv match the variables in {} in the template, plus a reference column for the answer (this can be empty, but the heading should be there), and optionally and id if you have an alternative naming scheme for the subtasks(each row is a subtask)

we can look at the other task too:

symb, reference
@, at
#, pound
\$, dollar sign

Running a benchmark#

Then back int he top of the benchmark folder

We can see the help for the command

Usage: benchtool run [OPTIONS] BENCHMARK_PATH

  Running the benchmark and generating logs , help="The path to the benchmark
  repository where all the task reside."

Options:
  -r, --runner-type [ollama|openai|aws]
                                  The engine that will run your LLM.
  -m, --model TEXT                The LLM to be benchmarked.
  -a, --api-url TEXT              The api call required to access the runner
                                  engine.
  -l, --log-path TEXT             The path to a log directory.
  --help                          Show this message and exit.

Warning

this will be filled in later

We can run a benchmark by name

Running list_bench now
info.yml	logs		tasks.yml

it creates a logs folder if one does not already exist

Exploring a yaml benchmark#

gemma3

there will be a folder per log

product	symbol

then per task

1771533769

then per run, named by the timestamp of the run start

product_2-3	product_3-4	product_5-5	run_info.yml
bench_name: list_bench
bench_path: listbench/
description: null
id_generator: concatenator_id_generator
log_path: listbench/logs/gemma3/product/1771533769
name: product
reference:
- 6
- 12
- 25
run_id: '1771533769'
scorer: exact_match
template: find the product of {a} and {b}
values:
- a: 2
  b: 3
- a: 3
  b: 4
- a: 5
  b: 5

it stored overall information for the run

log.json	log.txt

and a log for each prompt in both text and json format

response ------
The product of 2 and 3 is 2 * 3 = 6.

So the answer is $\boxed{6}$.
{
    "task_name": "product",
    "template": "find the product of {a} and {b}",
    "prompt_name": "product_2-3",
    "error": "None",
    "steps": {
        "0": {
            "prompt": "find the product of 2 and 3",
            "response": "The product of 2 and 3 is 2 * 3 = 6.\n\nSo the answer is $\\boxed{6}$."
        }
    }
}

Initializing a new benchmark#

Usage: benchtool [OPTIONS] COMMAND [ARGS]...

  BenchTools is a tool that helps researchers set up benchmarks.

Options:
  --help  Show this message and exit.

Commands:
  add-task  Set up a new task.
  init      Initializes a new benchmark.
  run       Running the benchmark and generating logs , help="The path to...
  run-task  Running the tasks and generating logs
Usage: benchtool init [OPTIONS] [BENCHMARK_NAME]

  Initializes a new benchmark.

  Benchmark-name is required, if not provided, requested interactively.

  this command creates the folder for the benchmark.

Options:
  -p, --path TEXT   The path where the new benchmark repository will be placed
  -a, --about TEXT  Benchmark describtion. Content will go in the about.md
                    file
  --no-git          Don't make benchmark a git repository. Default is False
  --help            Show this message and exit.

it asks questions interactively

benchtools	example

about.md	info.yml	tasks
bench_name: example
concept: in class example benchmark
tasks:
- id: animal
  name: animal
  storage_type: yaml
- description: 'give your task a short description '
  id_generator: concatenator_id_generator
  name: animal
  reference: ''
  scorer: exact_match
  template: Your {noun} for the model here with values that should vary              denoted
    in brackets. {verb} matching  keys below
  values:
    noun:
    - text
    - task
    verb:
    - use
    - select
- description: 'animal identifcaiton '
  id_generator: concatenator_id_generator
  name: animal
  reference: ['zebra', 'tiger', "cheetah"]
  scorer: exact_match
  template: an animal has a {pattern}, {feet}, and {skin}. what kind of animal is it?
  values:
    pattern:
    - stripes
    - stripes
    - spots
    skin:
    - hairy
    - hairy
    - hairy
    feet: 
    - hooves
    - paws
    - paws

Get updates#

Tip

Watch the repo to get notifications for important updates

Then update by pulling

and re-installing: