Recovering the underlying SAE vectors from Goodfire's API

My Apart Hackathon project

Nov 26, 2024

Last weekend I participated in Apart’s Hackathon about what is possible with Goodfire’s API/SDK. For those with no prior knowledge of mechanistic interpretability or SAEs, I can recommend this blogpost by Scott Alexander. Goodfire is a mech interp start-up that has created their own SAEs and various tools on top of that to help people steer and understand LLM behaviour. I recommend skimming their documentation to see what it can do!

My project was to see if one can re-construct the underlying SAE vectors from the API provided. You can find the report and codebase for my project on the Apart website. Quoting the abstract:

In this project, we carry out an early trial to see whether Goodfire’s SAE feature vectors can be recovered using the information available from their API.
The strategy is: pick a feature of interest, construct a contrastive dataset using Goodfire’s API, then use TransformerLens to get a steering vector for the contrastive dataset, by simply calculating the average difference in the activations in each pair.
The early trial produced underwhelming results, however, I am confident one could reconstruct the underlying SAE feature vectors, with further experimentation and/or better techniques for obtaining the steering vector from a dataset.

It's Not Obvious

Discussion about this post