Are Sakana lying about the independence on code templates?
TLDR
Their claim that the “AI scientist-v2 eliminates the reliance on human-authored code templates” is in the most literal sense false, because the idea generation explicitly requires human generated code as input.
However, the *role* that this human generated code plays in the whole pipeline is less central in version 2 compared to version 1. In version 1, the code template was a complete self-contained specific research experiment and the AI scientist would do incremental research on top of it. In version 2, the code template (at least in the examples provided) is highly generic code that can be adapted for a large range of ML research questions, meaning that version 2 is capable of open-ended research.
The note on their GitHub repo (which notably does not appear in their original announcement nor the technical report) provides valuable context: “The AI Scientist-v2 doesn’t necessarily produce better papers than v1, especially when a strong starting template is available. V1 follows well-defined templates, leading to high success rates, while v2 takes a broader, more exploratory approach with lower success rates. V1 works best for tasks with clear objectives and a solid foundation, whereas v2 is designed for open-ended scientific exploration.”
Details
A month ago, Sakana announced that their AI Scientist-v2 created papers that got accepted into a top ML conference. This week, they published a Technical Report and open-sourced their code on GitHub.
As I have been experimenting with the version 1 system, I was keen to dig in and find out what is different. However, the first detail I looked into seemed to contradict a core claim they were making in the report.
From the abstract:
Compared to its predecessor (v1, Lu et al., 2024), The AI Scientist-v2 eliminates the reliance on human-authored code templates
This is re-iterated in the introduction:
First, we eliminate the dependency on human-provided code templates, significantly increasing the system's autonomy and ability to be deployed out of the box across multiple machine learning domains.
And then you look at the first prompt given in the B Appendix on Page 20:
Ensure that the proposal can be done starting from the provided codebase.
This is surprising, if not suspicious! The first instruction in the whole pipeline is to create ideas consistent with a provided codebase, so how is this independent from human-provided code? Maybe this is a mistake somehow??
I dig into the code base to better understand. The code for idea generation is in the perform_ideation file.
As in the appendix, the system prompt does say you should start from the provided codebase:
And then in the idea generation prompt, we do see where the code is provided to the LLM:
Unfortunately, unlike in their version 1 codebase, they do not provide explicit examples of how to run their pipeline end to end. However, there are a couple of python files in the ideas folder which look like they played the role of `experiment.py` (but hard to say, as there is not an accompanying file containing a task desription). The python files contain generic code to carry out a standard ML pipeline (loading data and training a neural network on it), unlike the python files in version 1 which carried out specific experiments related to a particular research question.
Looking further in the perform_ideation.py file, we see at the bottom it has a default value for the name of the experiment as ‘nanoGPT’.
If you search for `nanoGPT`, you will not find any other reference in the rest of the codebase. This is obviously going to be confusing for most people, but there is an explanation: nanoGPT is actually a template they used in version 1!
Conclusion
As said in the TLDR, I think the statement that their pipeline is independent of human provided code is false, however, it looks like the code provided to the system is more generic, allowing for the system to carry out larger range of research.
However, I am not 100% sure and I suspect there is something I am missing or have misunderstood. This is my current understanding based on a quick skim of the paper and trying to understand the perform_ideation.py file. Maybe more is revealed in the details of how the system actually writes code and carries out experiments.
Regardless, the communication is certainly confusing and seemingly inconsistent, so even if I have misunderstood, the Sakana team should have made things clearer.







