The paper introduces AMemGym, a pioneering on-policy interactive memory benchmark aimed at improving the memory of assistant AI through long-horizon conversations. This novel approach reveals the limitations of “Reuse Bias” found in static evaluations and underscores how current large language models excel with immediate facts but falter when long-term memory is needed. Traditional memory benchmarks have been off-policy, relying on fixed chat logs that fail to capture the dynamic nature of real interactions, leading to misleading rankings and poor configuration choices. By shifting to on-policy testing, AMemGym allows assistants to engage with a simulated user in real-time, where responses shape subsequent exchanges. This helps in identifying and correcting memory lapses, thereby strengthening the capability of AI to adapt to users’ changing preferences over time.
