Library#

A tiny example#

we can create a tiny benchmark programmatically

from benchtools import Bench

tiny_bench = Bench('Tiniest Demo', concept ='the simplest test')

we can also create a simple task programmatically

from benchtools import Task

tt = Task('greeting','Hello there','hi', 'contains')

response = tt.run()

tt.score(response)

tiny_bench.add_task(tt)

There are multiple ways to creating a Task object

add_task = Task.from_txt_csv('../../demos/folderbench/tasks/add')
tiny_bench.add_task(add_task)

For demo purposes we delete the folder, if it exists, before running.

%%bash
rm  -rf tiniest_demo

We create a new folder for a benchmark to store it in the file system

tiny_bench.initialize_dir()
tiny_bench.run()

pre_built_yml = Bench.from_yaml('../../demos/listbench')
pre_built_yml.written

we can access individual tasks:

pre_built_yml.tasks['product'].variant_values

pre_built_yml.run()

demo_bench = Bench.from_yaml('../../demos/listbench')

Creating a Benchmark object#

class benchtools.runner.BenchRunner(runner_type='ollama', model='gemma3:1b', api=None)#: A BenchRunner holds information about how a task is going to be run.

class benchtools.runner.BenchRunnerList(runners: list[BenchRunner])#

a set of runners

classmethod from_file(file_path)#

load from yaml file file can have a list with values for all 3 fields or a single set of values. the model key can take a list missing values get the defaults.

Paramters#

file_pathpath or string: path to file or dir with runner.yml

Benchmark class#

class benchtools.benchmark.Bench(name, base_path='.', benchmark_path=None, concept=None, tasks=[])#

Benchmark with multiple tasks

bench_name#

Name of the benchmark.

Type:: str

benchmark_path#

Path to where the benchmark folder and all its content reside

Type:: str

task_folder#: Path to tasks folder insise benchmark folder

log folder: Path to logs folder inside benchmark folder

tasks#

Type:: list of Task objects

is_built#

Type:: bool

Build the benchmark directory.

add_task(): Add new tasks to the benchmark
run(): Run one task or all tasks of the benchmark.

classmethod from_folders(benchmark_path)#

Load a benchmark object from a given path. The path should point to the benchmark folder.

Parameters:#

benchmark_path: str

The path to the benchmark folder. The folder should contain the about.md file,: tasks folder and logs folder.

Returns:#

Bench: An instance of the Bench class with the loaded benchmark.

classmethod from_yaml(benchmark_path)#

Load tasks from a YAML file and generate Task objects and add them to the bench

Parameters:: benchmark_path (str) – Path to the YAML file containing task templates and values.
Returns:: self – The Bench instance with tasks populated.
Return type:: Bench

init_repo(benchmark_path)#

Initialize the benchmark folder as git repo with gitiginore for python

Parameters:#

benchmark_path: str: The path to the benchmark folder

initialize_dir(no_git=False)#

write out the benchmark folder initially

Parameters:#

about_text: str: description of the benchmark to be included in the about.md file
no_git: bool: whether to initialize a git repository in the benchmark folder
new_tasks: list of tuples (task_name, task_source): list of tasks to be added to the benchmark. Each task is represented as a tuple containing

Returns:#

self.written: bool: True if the benchmark was successfully built, False otherwise

classmethod load(benchmark_path)#

Load a benchmark object from a given path. If the path given is has a yaml tasks file, load tasks from it and generate Task objects and add them to the bench. Otherwise load the bench object from existing task folders.

Parameters:#

benchmark_path: str

The path to the benchmark folder. The folder should contain the about.md file,: tasks.yaml file or tasks folder.

Returns:#

Bench: An instance of the Bench class with the loaded benchmark.

run(runner=<benchtools.runner.BenchRunner object>, log_dir=None, score=False)#

Run the benchmark by running each task in the benchmark and logging the interactions. Parameters: ———– runner: BenchRunner

define which runner should be used for the task.

log_dir: str: Path to where the logs should be saved
scorebool: to run scoring now or not

run_task(target_task=None, runner=<benchtools.runner.BenchRunner object>, log_dir=None, score=False)#: run a specific task

score(model=None, task=None, run='last', collate=False)#

Run the benchmark by running each task in the benchmark and logging the interactions.

Parameters:#

modelstr, list: model to score
task: str, list: task to sore
run: str or list: ‘last’, ‘all’, runid or list of run ids

returns:: score_list – list of dictionaries of scores
rtype:: list[dict]

Task class#

class benchtools.task.Task(task_name, template, reference=None, scoring_function=None, variant_values=None, storage_type='yaml', description=None, prompt_id_generator_fx=<function concatenator_id_generator>, format='StringAnswer', source_path=None)#

defines a basic prompt task with a simple scoring function

classmethod from_dict(task_dict, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a task from a dictionary, The dictionary should have the following structure: {

“template”: string, “values”: list of dicts (optional), “reference”: string, number, or list of strings or numbers the same shape as variant values (optional), “scoring_function”: string or function handle (optional)

}

classmethod from_example(task_name, storage_type)#: make a blank task

classmethod from_hf_dataset(task_name, hf_path, prompt_column='prompt', answer_column='canonical_solution')#: dataset must have columns ‘prompt’ and ‘canonical_solution’ for now, can be expanded in the future.

classmethod from_txt_csv(task_path, task_name=None, scoring_function=None, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a template from txt and create task objects for each row of a csv

folder must contain a template.txt file with the template, and a values.csv file with the values to fill in the template, and the reference answers. it can optionally have an info.yml with additional settings

Parameters:

task_path (string or path) – where the task files are
task_name (string) – name
scoring_function (callable or string) – how to score the task
prompt_id_generator_fx (callable or string) – over-ruled if ‘id’ columns in values.csv
source_path (string or file buffer) – path to custom code

classmethod from_yaml(source_path, task_name=None, scoring_function=None)#

load a task from a yaml file. The yaml file should have the following structure: name: string template: string values: list of dicts (optional) reference: “calculated” or string, number, or list of strings or numbers the same

shape as variant values (optional)

scoring_function: string or function handle (optional)

generate_prompts()#: if the task is a template based task, generate the prompts by filling in the template with the variant values

get_bench_data()#: get the data for the benchark info file, which includes the name, and storage type.

static parse_scorer(scoring_function, source_path)#: parse a scorer from input into a callable

run(runner=<benchtools.runner.BenchRunner object>, log_dir='logs', benchmark=None, benchmark_path=None, score=False)#

run the task on the stated model and log the interactions.

Parameters:

runner (BenchRunner) – define which runner should be used for the task.
log_dir (str) – Path to where the logs should be saved. If empty a log folder will be created in the current working directory

Returns:

response – model response(s)

Return type:

list

score(response, prompt_id=None)#

score the response using the defined function

Parameters:: response (string) – the value to score

write(target_path)#: write the task

write_csv(target_folder)#: write the task to a csv file with a task.txt template file

write_yaml(target_path)#: write the task to a yaml file

BetterBench#

class benchtools.task.Task(task_name, template, reference=None, scoring_function=None, variant_values=None, storage_type='yaml', description=None, prompt_id_generator_fx=<function concatenator_id_generator>, format='StringAnswer', source_path=None)#

defines a basic prompt task with a simple scoring function

classmethod from_dict(task_dict, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a task from a dictionary, The dictionary should have the following structure: {

“template”: string, “values”: list of dicts (optional), “reference”: string, number, or list of strings or numbers the same shape as variant values (optional), “scoring_function”: string or function handle (optional)

}

classmethod from_example(task_name, storage_type)#: make a blank task

classmethod from_hf_dataset(task_name, hf_path, prompt_column='prompt', answer_column='canonical_solution')#: dataset must have columns ‘prompt’ and ‘canonical_solution’ for now, can be expanded in the future.

classmethod from_txt_csv(task_path, task_name=None, scoring_function=None, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#

load a template from txt and create task objects for each row of a csv

folder must contain a template.txt file with the template, and a values.csv file with the values to fill in the template, and the reference answers. it can optionally have an info.yml with additional settings

Parameters:

task_path (string or path) – where the task files are
task_name (string) – name
scoring_function (callable or string) – how to score the task
prompt_id_generator_fx (callable or string) – over-ruled if ‘id’ columns in values.csv
source_path (string or file buffer) – path to custom code

classmethod from_yaml(source_path, task_name=None, scoring_function=None)#

load a task from a yaml file. The yaml file should have the following structure: name: string template: string values: list of dicts (optional) reference: “calculated” or string, number, or list of strings or numbers the same

shape as variant values (optional)

scoring_function: string or function handle (optional)

generate_prompts()#: if the task is a template based task, generate the prompts by filling in the template with the variant values

get_bench_data()#: get the data for the benchark info file, which includes the name, and storage type.

static parse_scorer(scoring_function, source_path)#: parse a scorer from input into a callable

run(runner=<benchtools.runner.BenchRunner object>, log_dir='logs', benchmark=None, benchmark_path=None, score=False)#

run the task on the stated model and log the interactions.

Parameters:

runner (BenchRunner) – define which runner should be used for the task.
log_dir (str) – Path to where the logs should be saved. If empty a log folder will be created in the current working directory

Returns:

response – model response(s)

Return type:

list

score(response, prompt_id=None)#

score the response using the defined function

Parameters:: response (string) – the value to score

write(target_path)#: write the task

write_csv(target_folder)#: write the task to a csv file with a task.txt template file

write_yaml(target_path)#: write the task to a yaml file