The paper introduces AMemGym, a pioneering on-policy interactive memory benchmark aimed at improving the memory of assistant AI through long-horizon conversations. This novel approach reveals the limitations of “Reuse Bias” found in static evaluations and underscores how current large language models excel with immediate facts but falter when long-term memory is needed. Traditional memory benchmarks have been off-policy, relying on fixed chat logs that fail to capture the dynamic nature of real interactions, leading to misleading rankings and poor configuration choices. By shifting to on-policy testing, AMemGym allows assistants to engage with a simulated user in real-time, where responses shape subsequent exchanges. This helps in identifying and correcting memory lapses, thereby strengthening the capability of AI to adapt to users’ changing preferences over time.
AMemGym introduces interactive memory benchmark for AI assistants
