Library#

A tiny example#

we can create a tiny benchmark programmatically

from benchtools import Bench

tiny_bench = Bench('Tiniest Demo', concept ='the simplest test')

we can also create a simple task programmatically

from benchtools import Task

tt = Task('greeting','Hello there','hi', 'contains')
response = tt.run()
tt.score(response)
tiny_bench.add_task(tt)

There are multiple ways to creating a Task object

add_task = Task.from_txt_csv('../../demos/folderbench/tasks/add')
tiny_bench.add_task(add_task)

For demo purposes we delete the folder, if it exists, before running.

%%bash
rm  -rf tiniest_demo

We create a new folder for a benchmark to store it in the file system

tiny_bench.initialize_dir()
tiny_bench.run()
pre_built_yml = Bench.from_yaml('../../demos/listbench')
pre_built_yml.written

we can access individual tasks:

pre_built_yml.tasks['product'].variant_values
pre_built_yml.run()
demo_bench = Bench.from_yaml('../../demos/listbench')

Creating a Benchmark object#

class benchtools.runner.BenchRunner(runner_type='ollama', model='gemma3:1b', api=None)#

A BenchRunner holds information about how a task is going to be run.

class benchtools.runner.BenchRunnerList(runners: list[BenchRunner])#

a set of runners

classmethod from_file(file_path)#

load from yaml file file can have a list with values for all 3 fields or a single set of values. the model key can take a list missing values get the defaults.

Paramters#

file_pathpath or string

path to file or dir with runner.yml

Benchmark class#

class benchtools.benchmark.Bench(name, base_path='.', bench_path=None, concept=None, tasks=[])#

Benchmark with multiple tasks

bench_name#

Name of the benchmark.

Type:

str

bench_path#

Path to where the benchmark folder and all its content reside

Type:

str

task_folder#

Path to tasks folder insise benchmark folder

log folder

Path to logs folder inside benchmark folder

tasks#
Type:

list of Task objects

is_built#
Type:

bool

Build the benchmark directory.
add_task()

Add new tasks to the benchmark

run()

Run one task or all tasks of the benchmark.

classmethod from_folders(bench_path)#

Load a benchmark object from a given path. The path should point to the benchmark folder.

Parameters:#

bench_path: str
The path to the benchmark folder. The folder should contain the about.md file,

tasks folder and logs folder.

Returns:#

Bench

An instance of the Bench class with the loaded benchmark.

classmethod from_yaml(bench_path)#

Load tasks from a YAML file and generate Task objects and add them to the bench

Parameters:

bench_path (str) – Path to the YAML file containing task templates and values.

Returns:

self – The Bench instance with tasks populated.

Return type:

Bench

init_repo(bench_path)#

Initialize the benchmark folder as git repo with gitiginore for python

Parameters:#

bench_path: str

The path to the benchmark folder

initialize_dir(no_git=False)#

write out the benchmark folder initially

Parameters:#

about_text: str

description of the benchmark to be included in the about.md file

no_git: bool

whether to initialize a git repository in the benchmark folder

new_tasks: list of tuples (task_name, task_source)

list of tasks to be added to the benchmark. Each task is represented as a tuple containing

Returns:#

self.written: bool

True if the benchmark was successfully built, False otherwise

classmethod load(benchmark_path)#

Load a benchmark object from a given path. If the path given is has a yaml tasks file, load tasks from it and generate Task objects and add them to the bench. Otherwise load the bench object from existing task folders.

Parameters:#

bench_path: str
The path to the benchmark folder. The folder should contain the about.md file,

tasks.yaml file or tasks folder.

Returns:#

Bench

An instance of the Bench class with the loaded benchmark.

run(runner=<benchtools.runner.BenchRunner object>, log_dir=None, score=False)#

Run the benchmark by running each task in the benchmark and logging the interactions. Parameters: ———– runner: BenchRunner

define which runner should be used for the task.

log_dir: str

Path to where the logs should be saved

scorebool

to run scoring now or not

run_task(target_task=None, runner=<benchtools.runner.BenchRunner object>, log_dir=None, score=False)#

run a specific task

score(model=None, task=None, run='last', collate=False)#

Run the benchmark by running each task in the benchmark and logging the interactions.

Parameters:#

modelstr, list

model to score

task: str, list

task to sore

run: str or list

‘last’, ‘all’, runid or list of run ids

returns:

score_list – list of dictionaries of scores

rtype:

list[dict]

Task class#

class benchtools.task.Task(task_name, template, reference=None, scoring_function=None, variant_values=None, storage_type='yaml', description=None, prompt_id_generator_fx=<function concatenator_id_generator>, format='StringAnswer', source_path=None)#

defines a basic prompt task with a simple scoring function

classmethod from_dict(task_dict, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a task from a dictionary, The dictionary should have the following structure: {

“template”: string, “values”: list of dicts (optional), “reference”: string, number, or list of strings or numbers the same shape as variant values (optional), “scoring_function”: string or function handle (optional)

}

classmethod from_example(task_name, storage_type)#

make a blank task

classmethod from_hf_dataset(task_name, hf_path, prompt_column='prompt', answer_column='canonical_solution')#

dataset must have columns ‘prompt’ and ‘canonical_solution’ for now, can be expanded in the future.

classmethod from_txt_csv(task_path, task_name=None, scoring_function=None, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a template from txt and create task objects for each row of a csv

folder must contain a template.txt file with the template, and a values.csv file with the values to fill in the template, and the reference answers. it can optionally have an info.yml with additional settings

Parameters:
  • task_path (string or path) – where the task files are

  • task_name (string) – name

  • scoring_function (callable or string) – how to score the task

  • prompt_id_generator_fx (callable or string) – over-ruled if ‘id’ columns in values.csv

  • source_path (string or file buffer) – path to custom code

classmethod from_yaml(source_path, task_name=None, scoring_function=None)#

load a task from a yaml file. The yaml file should have the following structure: name: string template: string values: list of dicts (optional) reference: “calculated” or string, number, or list of strings or numbers the same

shape as variant values (optional)

scoring_function: string or function handle (optional)

generate_prompts()#

if the task is a template based task, generate the prompts by filling in the template with the variant values

get_bench_data()#

get the data for the benchark info file, which includes the name, and storage type.

static parse_scorer(scoring_function, source_path)#

parse a scorer from input into a callable

run(runner=<benchtools.runner.BenchRunner object>, log_dir='logs', benchmark=None, bench_path=None, score=False)#

run the task on the stated model and log the interactions.

Parameters:
  • runner (BenchRunner) – define which runner should be used for the task.

  • log_dir (str) – Path to where the logs should be saved. If empty a log folder will be created in the current working directory

Returns:

response – model response(s)

Return type:

list

score(response, prompt_id=None)#

score the response using the defined function

Parameters:

response (string) – the value to score

write(target_path)#

write the task

write_csv(target_folder)#

write the task to a csv file with a task.txt template file

write_yaml(target_path)#

write the task to a yaml file

BetterBench#

class benchtools.task.Task(task_name, template, reference=None, scoring_function=None, variant_values=None, storage_type='yaml', description=None, prompt_id_generator_fx=<function concatenator_id_generator>, format='StringAnswer', source_path=None)#

defines a basic prompt task with a simple scoring function

classmethod from_dict(task_dict, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a task from a dictionary, The dictionary should have the following structure: {

“template”: string, “values”: list of dicts (optional), “reference”: string, number, or list of strings or numbers the same shape as variant values (optional), “scoring_function”: string or function handle (optional)

}

classmethod from_example(task_name, storage_type)#

make a blank task

classmethod from_hf_dataset(task_name, hf_path, prompt_column='prompt', answer_column='canonical_solution')#

dataset must have columns ‘prompt’ and ‘canonical_solution’ for now, can be expanded in the future.

classmethod from_txt_csv(task_path, task_name=None, scoring_function=None, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a template from txt and create task objects for each row of a csv

folder must contain a template.txt file with the template, and a values.csv file with the values to fill in the template, and the reference answers. it can optionally have an info.yml with additional settings

Parameters:
  • task_path (string or path) – where the task files are

  • task_name (string) – name

  • scoring_function (callable or string) – how to score the task

  • prompt_id_generator_fx (callable or string) – over-ruled if ‘id’ columns in values.csv

  • source_path (string or file buffer) – path to custom code

classmethod from_yaml(source_path, task_name=None, scoring_function=None)#

load a task from a yaml file. The yaml file should have the following structure: name: string template: string values: list of dicts (optional) reference: “calculated” or string, number, or list of strings or numbers the same

shape as variant values (optional)

scoring_function: string or function handle (optional)

generate_prompts()#

if the task is a template based task, generate the prompts by filling in the template with the variant values

get_bench_data()#

get the data for the benchark info file, which includes the name, and storage type.

static parse_scorer(scoring_function, source_path)#

parse a scorer from input into a callable

run(runner=<benchtools.runner.BenchRunner object>, log_dir='logs', benchmark=None, bench_path=None, score=False)#

run the task on the stated model and log the interactions.

Parameters:
  • runner (BenchRunner) – define which runner should be used for the task.

  • log_dir (str) – Path to where the logs should be saved. If empty a log folder will be created in the current working directory

Returns:

response – model response(s)

Return type:

list

score(response, prompt_id=None)#

score the response using the defined function

Parameters:

response (string) – the value to score

write(target_path)#

write the task

write_csv(target_folder)#

write the task to a csv file with a task.txt template file

write_yaml(target_path)#

write the task to a yaml file