Evals best practices

Notes from a talk by Apollo's co-founder

Dec 06, 2024

Manually reading through logs of LLM outputs is one of the best things you can do as a researcher.
Do MVPs in all stages of the project as fast as possible. Iterate fast. Get feedback regularly. Focus on what will give you most information, both about the AI systems you are trying to learn about, but also about how to carry out your project.
Write things down in detail. Many benefits.
Use LLMs to help in all parts, but their helpfulness does vary in different stages. Help the LLM help you by providing context - this is one benefit of writing things down in detail!
Upfront conceptual work matters. It is easy to jump into an evaluation project, but at the end of it you do not actually learn anything useful because you did not measure what you actually cared about.
- But balance with other reasons for doing evals, e.g. upskilling.

Introduction

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty.

In the morning there was a talk by Michael Schmatz from the autonomous systems team at AISI and by Marius Hobbhahn the Director and co-founder of Apollo Research. I already summarized my learnings from the Hackathon and Michael’s talk. This post is my summary of Marius’s talk.

Contents of talk

Tips from Dan
Threat modelling
Specifying the evals
Making the evals
Running the evals
Evaluating the evals
Distributing the results
General tips

Tips from Dan

[In my notes, I did not record who Dan is]

Design evals so you can detect a 1% improvement. Do not want things to have huge discontinuous jumps
Use a good metric
- Where possible, use existing standard metrics
- Want it to be understood by people in under 10 seconds
Make evals comparable over time, i.e. dont change your evals (too much) over time
Make sure the evals are hard enough
- But there are exceptions. E.g. want to make point that LLMs are already capable or dangerous. ‘Hey look, LLMs get 90% on this dangerous thing people do not know about.’
- Often 5 to 10% (or more precisely slightly above chance) is what to aim for in your evals

Comment from Marius about these tips from Dan
- With agentic tasks, not always possible to get all these.
- Cannot run agentic evals many times on many slight variations, so hard to detect 1% improvements.
- Something about if you have private evals (as opposed to open source evals), you can easily do fair comparisons over time

Threat modelling

Know what you fundamentally care about, regardless of your safety stance.
- “I want to measure SWE skills” vs “I care if junior SWEs can be replaced”
Have multiple scenarios in mind for how your outcome can happen.
Create graph (in sense of nodes and edges) of capabilities and paths to reach final outcome you care about.

When doing this threat model, you will get huge graph. Hence you need to refine and think about which parts of the graph you are most uncertain about, and which evals will give you the most information given your threat model.

Specifying the evals

Describe the property you’re trying to measure in detail. Describes examples and non-examples of what would be included in the eval.
Good questions to ask of your evals:
- Would it change your mind if models scored highly.
- Is there clear ‘red line’ for the evals, where the AI system surpassing ‘red line’ threshold gives you or others reason to react.
Red-team your specification
- Do this yourself and with somebody else critiquing the specs.
- Iterate until you have a clear specification and some good examples.
BIG emphasis on writing all this down.
- Aim is to have enough detail so an SWE/MLE can implement the evals without having to ask you any questions. This includes details like which packages you will use for different parts of the pipeline.
- By making all this explicit, you can iterate on design choices without having to do any implementation, pre-empt complications in advance, and produce better estimates for how long and complex the project will be.
- The more you write down, the more LLMs can help you create the evals!
Specifying the eval can take many months
- This is long, but the alternative is to under-specify the eval and prematurely commit months of effort creating and running evals whose results are not actually informative.

Making the eval

Create MVP (minimum viable product)

Do this as fast as possible. Make shittiest thing that runs to end to end. Will help you discover various breaking points. E.g. maybe you assumed a certain package would solve one of your problems, but MVP reveals you need to write code from scratch.
Aim for fast empirical feedback
Sometimes you need to make dramatic pivot
- E.g. if SOTA models can just do the task
- E.g. certain component of evals makes no sense

Build full evals

Includes vetting by others. See Q&A process from METR. [I am not sure what this refers to - closest I could find is METR’s Task Development Guide.]
- Roughly having somebody else questioning you and giving you feedback.
Can include human baselines. [The METR Task Development Guide has some advice on this.]
Should include many minor variations to the evals
- E.g. how much context about the task do you provide in prompts, like whether you include “You are being monitored” or not in the prompt.
- Allows more continuous measures.
- [Worth seeing METR’s elicitation guide for variations to try to maximize the maximum performance you elicit from the AI system.]
Often a challenge is getting past ‘measuring zero’. You can get stuck in situation where AI system is scoring zero on task and hard to know how far you are away from getting non-zero score.

There was question from audience about whether you should vary all the parameters available (e.g. temperature). Marius said that you are limited by time/compute so pick those variables that you consider most important to vary. Also repeated idea of iterating quickly - if you are not sure just pick an arbitrary choice and go with it, and update with new information.

Run the evals

Run eval multiple times. Check stats so you know how many runs you need for various outputs or conclusions
Look at both qualitative and quantitative results
Manually read through logs. This is best thing you can do as researcher. Only way to learn how to change things.
- Example of Sonnet not being deceptive despite other models being deceptive, according to the quantitative metrics. Only when looking in detail, found that Sonnet just did the misaligned actions openly, without any deception, but revealing that it considered hiding its motives. Such nuance often not captured in metrics
Automatically process all the logs using LLMs. Ask LLM to find things that you are interested in, based on your own observations or what you would consider surprising.
Question about how much to fine tune scaffolding/prompts
- Depends on whether it is propensity vs ability.
- E.g. your question could be ‘Given a month of dedicated effort, how much can LLM help develop bio weapons or new virus or…’.

Evaluating the evals

What do the results of the evals tell you? Are you measuring what you think
What are alternative hypotheses or confounders?
- Suppose you are measuring power-seeking. There is high chance that power-seeking is correlated with other goals, given your evaluation setup, like wanting to take risky actions, or propensity for illegal actions, or being sycophantic, or …
- Think of new experiments as appropriate.
Run ablations, to check which parts of the pipelines are causally responsible for the results.
- For example, do not want your result to depend on precise wording of prompt
- Story of how in one experiment, their first prompt got the behaviour they measuring, counter to their expectations. But every other wording didn’t get result.
Run experiments that reduce uncertainty. This will change at various phases of the project.

Distribute the results

When interpreting and describing your results:

Be as accurate as possible. If the results are negative, report that.
Avoid biases. E.g. you will have incentives to tell a particular story.
- Story of an org presenting an inaccurate story about bio evals they did with open source models. The org lost reputation as a result, and are now having to work extra hard and diligently to re-build the trust.
Provide the context. What are the relevant threat models.

When publicizing:

Consider infohazards. In particular, open sourcing your evals will lead to future data contamination, so your evals become less useful over time.
Who is target audience?
Make it as easy as possible for non-technical people to understand

There was question about how people can trust your conclusions if the evals are not open-sourced. Marius made two points:

There is trade-off you have to make between how easy it is to believe your results and how useful your evals will be in the future.
In Apollo’s case, there are select stakeholders that review their evals in detail, to help build trust in their results.

General tips

Use LLMs to help with every step

Write a lot and dump in context. Help the LLM help you.
Currently not great with threat modelling. Good with specification, good with implementation with enough setup.
In short, always try and then stop if it does not look good.

Conceptual work matters a lot

Marius underestimated this. It is easy to just do an eval and then after a month of effort, you realise it does not help with anything.
Maybe fine for instrumental reasons (e.g. upskilling) to do just some evals without solid conceptual detail.

Write the eval out in a lot of specificity before starting

Iterate in theory before iterate in practice. It is faster.

Don’t use temp=0

Higher temperature and run more often
‘Weird stuff happens at temperature 0’

Write automated tests

For both individual parts, and also end-to-end

Sprint to MVP at almost every step of the process

Average across reasonable phrasings

Do not worry about having one single prompt or wording. Just do all of them and average

Its hard to make good pre-cursor evals

As much as possible, directly measure the thing you care about.

Very difficult to make good causal graphs

It's Not Obvious

Discussion about this post