Researchers from the University of Maryland, Lawrence Livermore National Laboratory, Columbia University, and TogetherAI have introduced a novel technique to considerably improve the latency of agentic artificial intelligence systems. By directly adjusting the weights within large language models (LLMs), they have achieved a threefold increase in inference speed. this advancement is made possible through Multi-Token prediction (MTP), a method that enables the model to generate several tokens together during a single forward pass. This approach effectively overcomes the limitations of conventional next-token prediction, which often slows down processing, especially in tasks requiring extended chains of reasoning.
The innovation also employs a self-distillation process, where a teacher model assesses the outputs produced by a student model to ensure consistency and minimize errors. This breakthrough is notably timely given the rising demand for faster and more efficient AI systems, especially in applications that involve complex reasoning and decision-making workflows.

