Gap Between LLMs and Applying Them To Real-World Scenarios
There is a significant misunderstanding between what is expected from an LLM and applying them to solve real-world problems.
Current LLMs function primarily as sophisticated auto-complete engines with relatively weak reasoning capabilities. This limitation manifests in several ways:
- Sensitivity to Prompt Wording: When solving open-ended problems, LLMs can generate a range of divergent outcomes depending on subtle variations in the prompt.
- Pathways-Based UX: The most effective user experience for LLM-driven or agentic applications often revolves around a pathways approach, where intent classification serves as the initial step.
- Stochastic Reasoning: Due to their probabilistic nature, LLMs exhibit randomness in complex reasoning tasks, leading to inconsistent outcomes.
As a result, the most popular implementations today are co-pilot applications that rely heavily on human oversight to mitigate errors.
Why Agentic Use Cases Struggle with Probabilistic Systems
Anthropic recently released an insightful blog post on applying probability theory to LLM evaluation.
To summarize, one useful way to conceptualize LLM performance is:
- At any given run, an LLM might be correct only with probability ( p ).
Suppose your agentic application requires the LLM to navigate a pathways-based UX flawlessly over a sequence of ( n ) steps. In that case, assuming each step is an independent Bernoulli trial, the overall success probability can be approximated by:
( P(success) = p^n )
Given this high compounded failure risk, developers have explored techniques such as majority voting and hedged execution, leveraging the "succeed-fast, fail-slow" principle. However, these guardrails often result in complex implementations and added overhead.
So what can we do as practitioners?
We need to think carefully about the use case. Is this use case okay if it is only correct 80% of the time? When it fails, how can you make sure it's easy to fix or change?
The best LLM use cases are ones with low entropy edits and can maximize N number of trials and where correctness is easily observed. One example being code editing (Cursor - duh).
If you are interested in thoughtful applications of multi-modal LLMs and applying them in real-world scenarios, reach out on LinkedIn.