a16z’s @stuffyokodraws has developed TetrisBench, a benchmark for evaluating large language models (LLMs) through the game Tetris, revealing distinct playing styles among different models. The project reframes Tetris as a coding task, enabling LLMs to generate deterministic scoring functions for moves rather than relying on turn-by-turn reasoning. This method has demonstrated that while LLMs generally outperform humans in structured scenarios, they struggle in off-distribution situations where human players utilize intuitive strategies and “controlled chaos.” Notably, models exhibit various personalities, with some opting for aggressive early plays and others maintaining conservative board tactics, further highlighting the differences in long-term strategic decision-making between humans and AI.
TetrisBench reveals distinct personalities in LLM gameplay
