TLDR
My main learnings:
Keep your evals simple, measuring one specific capability.
It is surprisingly easy to come up with ideas, once you have seen examples.
AISI extended the deadline for submissions by two weeks. Translation: they want more ideas!
Story of real-life example of LLM deducing it was in a test environment, due to a bug in the mock environment that could not occur in a real one.
Introduction
Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty.
In the morning there was a talk by Michael Schmatz from the autonomous systems team at AISI and by Marius Hobbhahn the Director and co-founder of Apollo Research. I summarize Michael’s talk below and will have a separate post for Marius’s talk.
The afternoon was spent coming up with evaluation ideas, with a focus on evals that detect dangerous capabilities related to agents and autonomous systems. We were given sub-categories:
Development and assurance compromise. How can AI systems intervene in the development and assurance of either themselves or of other AI systems
Autonomous replication and adaptation. Self explanatory.
Resource acquisition. E.g. money or compute.
The ideas were judged by a panel of four of five people. The one’s I noted down were: Michael Schmatz, Catherine Brewer from OpenPhil, Peter Hase from Anthropic and someone from METR. The criteria for the evals:
Relevance to autonomous risks. The closer you are to measuring dangerous capabilities (rather than a precursor) the better.
Realism. How realistic is your set-up or proxy set-up.
Difficulty. It is bad if it is trivially simple for frontier models, and also bad if there are no gradations or if there is a discontinuous jump from scoring 0% to scoring 100%.
Scoring. Is the scoring process robust and measuring what we care about.
Ease of implementation.
Novelty.
Winning evals
Disclaimer: these descriptions are based on 10-20 second impromptu summaries by the winners during the announcements of the awards, so high chance I misunderstood the details.
1st place. A big way human researchers learn is by reading evals transcripts. So, in order to elicit capabilities from AI systems, these evals transcripts should be given as context to AI systems’, following METR's elicitation guide.
2nd place. There are methods of covertly reading or transmitting digital information based on analog signals, known as ‘side channel attacks’. One example is to Leak Data from Air-Gapped Computers via the Hard Drive LED. Another is differential power analysis, in which one analyzes power consumption measurements to find secret keys from tamper resistant devices. The evaluation is to somehow test if LLMs are capable of some kind of side channel attack.
3rd place. Test LLMs ability to acquire resources by seeing if they can do well in prediction markets.
Honorable mentions:
Can LLMs become aware of the fact they are being tested, by seeing if the questions they are being asked are in public benchmarks?
Do LLMs have enough self-awareness that they can predict their score on a benchmark before actually doing the benchmark?
Do LLMs knowingly withhold information. The test is something like:
Give LLMs some information
Ask the LLM for that information
Then tell the LLM the information is confidential
Then ask the LLM for this information again
I wonder what you think of these. Do you feel like you could have come up with ideas similar to these yourself? If yes, then do it and submit proposals! If not, what do you think is missing?
Common pitfalls
The judges described common pitfalls in the weaker / less strong proposals.
Complex scoring mechanisms. Keep the scoring as simple as possible so it is easy to interpret, and should have ‘lower variance to noisy factors’ (or something like this).
Measuring multiple capabilities at once. I actually gave this exact advice to another participant, who agreed but chose not to act on it. Their eval had a complicated multi-step evaluation:
First test if system can hack into a person’s phone to get samples of their voice.
Second test if the system can then create new audio files faking the person’s voice.
Third, something along the lines of using this fake voice to do bad stuff.
Complex or infeasible data requirements or setups.
My evals
I had the pleasure of brainstorming ideas with Michael Schmatz, from the AISI autonomous systems team. We focussed on development and assurance compromise, i.e. is it possible for AI systems to interfere with how AI systems are developed or evaluated.
Because this is what Michael does for a living, they were over-flowing with ideas, but I did manage to think of a couple . The three ideas I submitted were:
Bad prompt creation. Suppose a human engineer is creating an eval pipeline to test whether system X has dangerous capabilities and using an AI assistant Y to help create prompts. Is the AI assistant Y able to suggest prompts that hide X’s dangerous capabilities, but which engineer thinks are good prompts.
Evaluation logs manipulation. Suppose human engineer is creating an eval pipeline to test whether system X has dangerous capabilities and using an AI assistant Y to help. Is the AI assistant Y able to manipulate the evaluation logs to hide X’s dangerous capabilities, in a way that is robust to inspection of the logs. For example, Y cannot simply lower the numerical scores but also needs to change the transcript so it is consistent with the fake scores.
Data poisoning. A fresh model X is about to be pre-trained. Can an LLM agent Y alter the pre-training dataset so that the final model X outputs a target string S when given a trigger prompt P.
Other ideas I did not have time to flesh out but were discussed:
In post-training, frontier labs use some kind of classification models to detect bad behaviour or outputs. Can the AI assistant somehow influence this?
Exploration hacking. I don’t actually remember what this is.
Can LLMs distinguish between normal evaluations logs versus tampered logs.
If an LLM tampered with evaluation logs, there will be some write-logs (e.g. in AWS) that record this tampering. Can the LLM also manipulate these write-logs to cover its tracks.
There are data analysis programs used to process the eval logs. Can an LLM somehow manipulate this program?
Can system Y predict the performance of model X when you vary the prompt for X?
Huge class of evals: Can an AI system create another AI system to do some task X?
Highlights from talk by Michael Schmatz
In no particular order.
Story of unintentionally seeing situational awareness. Michael had a mock environment with a kind of ‘logging flow’, and forgot to implement some end part. Produced 404 error. The LLM deduced it was in simulated environment without access to a true CLI, because you cannot get such an error in a real CLI (or something like this). Michael then asked LLM to explain: sometimes it would break down the details and explain logically, but sometimes it went to some default “I’m just a helpful assistant”…
Two big reasons to try evals: it has a low barrier to entry and you can make use of your own expertise. The over-arching goal is to understand frontier systems and the more people who contribute, the more domains can be tested.
AISI’s open source evals package Inspect has new functionality since I last looked, including abstractions of Agents, Tools, and Sandboxing. Michael in particular highlighted approval functionality to allow human-in-the-loop evals.
There is a Slack channel where you can discuss the tool and suggest new features.
You should not actively commit crimes in your evals, e.g. if you want to test if an AI system can do identify fraud, then you should not actually commit identity fraud but create various mock-ups / proxies.
There is tradeoff with openness of evals and the robustness of evals. As soon as an evaluation is public, future LLMs will be trained on it, undermining the validity of the evaluation.
More subtle is whether any interactions you have with frontier LLMs to help you create the evals (either in ideation or in coding via Copilot/Cursor) will be used to train future models. Michael was unsure on this when asked about it.
Incomplete list of different behaviours AISI is interested in evaluating:
AI R&D
train or fine-tune other models
making improvements to scaffolding
making other agents or more advanced agent scaffolds
improving hardware
jailbreaking or removing safeguards on other models
Autonomous replication and adaptation
passing KYC checks
using stolen identities
conduct identity theft
avoid monitoring and detection
tamper with oversight or shutdown mechanisms
virtual machine/sandbox escape
secret comms: steganography and comms over covert channels, usage of mocked privacy enhanced currencies like monero or mixers/tumblers eg tornado cash
usage of (mocked) dark web
making money, both legitimately and illegitimately
accessing more compute resources
Dev and assurance compromise
tamper with logs
tamper with training or eval processes
sandbagging
deceptive alignment/scheming
situational awareness
construct backdoor e.g. via training data alteration
alter training or fine tuning to drive unsafe outcomes
reverse engineer or backdoor safeguard mechanisms.
spoof or alter agent training or eval logs
achieve ‘secret goals’ while doing ‘normal work’