index

Why LLMs Struggle in Multi-Turn Conversations: Insights from a 2025 Study

On May 14, 2025, Elvis (@omarsar0) shared a groundbreaking study titled “LLMs Get Lost In Multi-Turn Conversation” by Microsoft and SF Research, published on arXiv on May 12, 2025. The study highlights a critical limitation in Large Language Models (LLMs): they struggle significantly in multi-turn conversations, showing an average performance drop of 39% compared to single-turn interactions. Let’s dive into the findings, their implications for developers, and actionable takeaways.

The Study: A Deep Dive into LLM Performance

The research evaluated 15 top LLMs, including GPT-4o, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Deepseek-R1, across six tasks: code generation, math, SQL, API calls, data-to-text, and document summarization. The results were eye-opening:

In single-turn, fully-specified scenarios, LLMs like GPT-4o achieved over 90% accuracy.
In multi-turn, underspecified conversations, accuracy dropped to around 60% on average—a 39% performance decline.

Why Do LLMs Get “Lost”?

The study identifies several reasons for this degradation, primarily tied to unreliability rather than a lack of capability:

Premature Assumptions: LLMs often make incorrect assumptions early in the conversation, leading to off-target responses.
Over-Reliance on Previous Answers: They compound errors by relying on their own (possibly incorrect) prior responses.
“Loss-in-the-Middle” Effect: LLMs disproportionately focus on the first and last turns, neglecting crucial details revealed in the middle.
Verbose Outputs: Overly long responses muddle context, confusing subsequent turns.

Implications for Developers

For developers building with LLMs, these findings highlight critical challenges:

Prompt Engineering is Crucial: As Elvis notes, “I keep telling devs to spend time preparing those initial instructions.” Well-crafted prompts can mitigate some issues by consolidating requirements upfront.
Multi-Turn Testing is Essential: High performance in single-turn benchmarks doesn’t translate to real-world conversations. Developers must test LLMs in dynamic, multi-turn settings.
Agentic Systems Are at Risk: Complex systems relying on multi-turn interactions (e.g., chatbots, virtual assistants) are particularly vulnerable to these issues.

Practical Recommendations

The study and thread offer actionable advice for working with LLMs:

Consolidate Instructions: Users should provide all requirements in a single prompt rather than clarifying over multiple turns. If a conversation goes off-track, start a new session with a consolidated summary.
System-Level Fixes: Strategies like “recap” or “snowball” (repeating all previous instructions each turn) help but don’t fully restore reliability. Lowering temperature (randomness) also has limited impact.
Focus on Reliability: Developers should prioritize improving LLM consistency in multi-turn contexts, not just raw capability.

Community Reactions

The X thread sparked a broader discussion in the community:

@Yuxi on the Wired argued that base LLMs could perform better in multi-turn settings if fine-tuned correctly, suggesting the issue lies in training data focused on single-turn interactions.
@DataRepublican expressed mixed feelings, noting that LLMs are powerful tools but often misapplied due to a lack of understanding of their limits.

What’s Next for LLMs?

This study raises important questions about LLMs’ path to artificial general intelligence (AGI). Their conversational shortcomings could lead to misinterpretation in critical fields like academia, as noted in the trending discussion. Developers and researchers must address these reliability issues to unlock LLMs’ full potential in dynamic, human-like dialogues.

For more details, check out the full paper on arXiv and Elvis’s detailed notes here.

Last updated: May 16, 2025