Library#
A tiny example#
we can create a tiny benchmark programmatically
from benchtools import Bench
tiny_bench = Bench('Tiniest Demo', concept ='the simplest test')
we can also create a simple task programmatically
from benchtools import Task
tt = Task('greeting','Hello there','hi', 'contains')
response = tt.run()
tt.score(response)
tiny_bench.add_task(tt)
There are multiple ways to creating a Task object
add_task = Task.from_txt_csv('../../demos/folderbench/tasks/add')
tiny_bench.add_task(add_task)
For demo purposes we delete the folder, if it exists, before running.
%%bash
rm -rf tiniest_demo
We create a new folder for a benchmark to store it in the file system
tiny_bench.initialize_dir()
tiny_bench.run()
pre_built_yml = Bench.from_yaml('../../demos/listbench')
pre_built_yml.written
we can access individual tasks:
pre_built_yml.tasks['product'].variant_values
pre_built_yml.run()
demo_bench = Bench.from_yaml('../../demos/listbench')
Creating a Benchmark object#
- class benchtools.runner.BenchRunner(runner_type='ollama', model='gemma3:1b', api=None)#
A BenchRunner holds information about how a task is going to be run.
- class benchtools.runner.BenchRunnerList(runners: list[BenchRunner])#
a set of runners
Benchmark class#
- class benchtools.benchmark.Bench(name, base_path='.', bench_path=None, concept=None, tasks=[])#
Benchmark with multiple tasks
- bench_name#
Name of the benchmark.
- Type:
str
- bench_path#
Path to where the benchmark folder and all its content reside
- Type:
str
- task_folder#
Path to tasks folder insise benchmark folder
- log folder
Path to logs folder inside benchmark folder
- tasks#
- Type:
list of Task objects
- is_built#
- Type:
bool
- Build the benchmark directory.
- add_task()
Add new tasks to the benchmark
- run()
Run one task or all tasks of the benchmark.
- classmethod from_folders(bench_path)#
Load a benchmark object from a given path. The path should point to the benchmark folder.
Parameters:#
- bench_path: str
- The path to the benchmark folder. The folder should contain the about.md file,
tasks folder and logs folder.
Returns:#
- Bench
An instance of the Bench class with the loaded benchmark.
- classmethod from_yaml(bench_path)#
Load tasks from a YAML file and generate Task objects and add them to the bench
- Parameters:
bench_path (str) – Path to the YAML file containing task templates and values.
- Returns:
self – The Bench instance with tasks populated.
- Return type:
- init_repo(bench_path)#
Initialize the benchmark folder as git repo with gitiginore for python
Parameters:#
- bench_path: str
The path to the benchmark folder
- initialize_dir(no_git=False)#
write out the benchmark folder initially
Parameters:#
- about_text: str
description of the benchmark to be included in the about.md file
- no_git: bool
whether to initialize a git repository in the benchmark folder
- new_tasks: list of tuples (task_name, task_source)
list of tasks to be added to the benchmark. Each task is represented as a tuple containing
Returns:#
- self.written: bool
True if the benchmark was successfully built, False otherwise
- classmethod load(benchmark_path)#
Load a benchmark object from a given path. If the path given is has a yaml tasks file, load tasks from it and generate Task objects and add them to the bench. Otherwise load the bench object from existing task folders.
Parameters:#
- bench_path: str
- The path to the benchmark folder. The folder should contain the about.md file,
tasks.yaml file or tasks folder.
Returns:#
- Bench
An instance of the Bench class with the loaded benchmark.
- run(runner=<benchtools.runner.BenchRunner object>, log_dir=None, score=False)#
Run the benchmark by running each task in the benchmark and logging the interactions. Parameters: ———– runner: BenchRunner
define which runner should be used for the task.
- log_dir: str
Path to where the logs should be saved
- scorebool
to run scoring now or not
- run_task(target_task=None, runner=<benchtools.runner.BenchRunner object>, log_dir=None, score=False)#
run a specific task
- score(model=None, task=None, run='last', collate=False)#
Run the benchmark by running each task in the benchmark and logging the interactions.
Parameters:#
- modelstr, list
model to score
- task: str, list
task to sore
- run: str or list
‘last’, ‘all’, runid or list of run ids
- returns:
score_list – list of dictionaries of scores
- rtype:
list[dict]
Task class#
- class benchtools.task.Task(task_name, template, reference=None, scoring_function=None, variant_values=None, storage_type='yaml', description=None, prompt_id_generator_fx=<function concatenator_id_generator>, format='StringAnswer', source_path=None)#
defines a basic prompt task with a simple scoring function
- classmethod from_dict(task_dict, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#
load a task from a dictionary, The dictionary should have the following structure: {
“template”: string, “values”: list of dicts (optional), “reference”: string, number, or list of strings or numbers the same shape as variant values (optional), “scoring_function”: string or function handle (optional)
}
- classmethod from_example(task_name, storage_type)#
make a blank task
- classmethod from_hf_dataset(task_name, hf_path, prompt_column='prompt', answer_column='canonical_solution')#
dataset must have columns ‘prompt’ and ‘canonical_solution’ for now, can be expanded in the future.
- classmethod from_txt_csv(task_path, task_name=None, scoring_function=None, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#
load a template from txt and create task objects for each row of a csv
folder must contain a template.txt file with the template, and a values.csv file with the values to fill in the template, and the reference answers. it can optionally have an info.yml with additional settings
- Parameters:
task_path (string or path) – where the task files are
task_name (string) – name
scoring_function (callable or string) – how to score the task
prompt_id_generator_fx (callable or string) – over-ruled if ‘id’ columns in values.csv
source_path (string or file buffer) – path to custom code
- classmethod from_yaml(source_path, task_name=None, scoring_function=None)#
load a task from a yaml file. The yaml file should have the following structure: name: string template: string values: list of dicts (optional) reference: “calculated” or string, number, or list of strings or numbers the same
shape as variant values (optional)
scoring_function: string or function handle (optional)
- generate_prompts()#
if the task is a template based task, generate the prompts by filling in the template with the variant values
- get_bench_data()#
get the data for the benchark info file, which includes the name, and storage type.
- static parse_scorer(scoring_function, source_path)#
parse a scorer from input into a callable
- run(runner=<benchtools.runner.BenchRunner object>, log_dir='logs', benchmark=None, bench_path=None, score=False)#
run the task on the stated model and log the interactions.
- Parameters:
runner (BenchRunner) – define which runner should be used for the task.
log_dir (str) – Path to where the logs should be saved. If empty a log folder will be created in the current working directory
- Returns:
response – model response(s)
- Return type:
list
- score(response, prompt_id=None)#
score the response using the defined function
- Parameters:
response (string) – the value to score
- write(target_path)#
write the task
- write_csv(target_folder)#
write the task to a csv file with a task.txt template file
- write_yaml(target_path)#
write the task to a yaml file
BetterBench#
- class benchtools.task.Task(task_name, template, reference=None, scoring_function=None, variant_values=None, storage_type='yaml', description=None, prompt_id_generator_fx=<function concatenator_id_generator>, format='StringAnswer', source_path=None)#
defines a basic prompt task with a simple scoring function
- classmethod from_dict(task_dict, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#
load a task from a dictionary, The dictionary should have the following structure: {
“template”: string, “values”: list of dicts (optional), “reference”: string, number, or list of strings or numbers the same shape as variant values (optional), “scoring_function”: string or function handle (optional)
}
- classmethod from_example(task_name, storage_type)#
make a blank task
- classmethod from_hf_dataset(task_name, hf_path, prompt_column='prompt', answer_column='canonical_solution')#
dataset must have columns ‘prompt’ and ‘canonical_solution’ for now, can be expanded in the future.
- classmethod from_txt_csv(task_path, task_name=None, scoring_function=None, prompt_id_generator_fx=<function concatenator_id_generator>, source_path=None)#
load a template from txt and create task objects for each row of a csv
folder must contain a template.txt file with the template, and a values.csv file with the values to fill in the template, and the reference answers. it can optionally have an info.yml with additional settings
- Parameters:
task_path (string or path) – where the task files are
task_name (string) – name
scoring_function (callable or string) – how to score the task
prompt_id_generator_fx (callable or string) – over-ruled if ‘id’ columns in values.csv
source_path (string or file buffer) – path to custom code
- classmethod from_yaml(source_path, task_name=None, scoring_function=None)#
load a task from a yaml file. The yaml file should have the following structure: name: string template: string values: list of dicts (optional) reference: “calculated” or string, number, or list of strings or numbers the same
shape as variant values (optional)
scoring_function: string or function handle (optional)
- generate_prompts()#
if the task is a template based task, generate the prompts by filling in the template with the variant values
- get_bench_data()#
get the data for the benchark info file, which includes the name, and storage type.
- static parse_scorer(scoring_function, source_path)#
parse a scorer from input into a callable
- run(runner=<benchtools.runner.BenchRunner object>, log_dir='logs', benchmark=None, bench_path=None, score=False)#
run the task on the stated model and log the interactions.
- Parameters:
runner (BenchRunner) – define which runner should be used for the task.
log_dir (str) – Path to where the logs should be saved. If empty a log folder will be created in the current working directory
- Returns:
response – model response(s)
- Return type:
list
- score(response, prompt_id=None)#
score the response using the defined function
- Parameters:
response (string) – the value to score
- write(target_path)#
write the task
- write_csv(target_folder)#
write the task to a csv file with a task.txt template file
- write_yaml(target_path)#
write the task to a yaml file