

Context
For three weeks in 2015, researchers asked commuters in London not to walk up the escalator to find out the optimal strategy at rush hours. This seems like a reasonably easy thing to model and simulate without annoying anyone for three weeks. But is it?
In this project, we ask different AI models to do the work of researchers... in less than 5 minutes. The objective is to challenge the claims that large language models can now fully automate scientific research. There are several dimensions to the research:
- a modelling challenge (passenger flow)
- a simulation challenge (high-level game engine)
- a rendering challenge (UI)
Methodology
LLMs, and reasoners in particular, can be extremely helpful to the extent that you provide sufficient context. The prompt below has been used for all the model outputs presented in this project.
The responses are screened for security and pasted into the app. If the code in the first response does not work, we ask the LLM to fix the issue until it renders. The number of queries are clearly indicated in the relevant tab.
The project also includes my own implementation (using AI assistants). It’s a work in progress, which I iterate on when I find time. As of Jan 31, 2025, I have spent around 40 hours working on the PITTI implementation. I believe that it is good enough to be shared but there are critical flows on the modelling side (the graph approach is neither appropriate nor well implemented). I will consider better options.
The app with all proposals (AI and human ones) can be run locally. See information in the README file.
Preliminary conclusions
While the models are definitely useful to lay the foundations, the claims that AI models can already fully automate research seem largely overblown. Here, the math is trivial and any undergrad with a math background would find a way to incorporate it in the modelling. It is basically about making the right choices. And on the UI side, it is also clear that you still need human involvement to piece everything together and give models a little nudge when they start going off-track.
To illustrate the takeaways of this project, I mapped each model output (very subjectively and in the most un-scientific way possible) :
- X-axis : for the UI
- Y-axis : for the reasoning
In each case, the projects are assessed on a scale from 0 to 10. Anything below 5 means that the model did not really understand the objectives, anything above 8 means that the objectives are met, albeit there is room for improvement.
For transparency, I later added a Z-axis to represent time spent. This dimension should not be ignored.

Next steps
Adding more models, and refine the human approach
To contribute to the project (both humans and AI suggestions)
- Create a new component in src/models
- import in App.jsx (should be self explanatory)
- submit PR
Feel free to propose alternative prompts for this project
