How Yelp Scaled LLM-Powered Search to Millions of Daily Queries: A 3-Step Production Framework
Discover how Yelp scaled LLM-powered search to handle millions of daily queries using a proven 3-step production framework. Learn the key infrastructure and optimization choices that made large scale deployment reliable and efficient.

When you're processing millions of search queries daily, every improvement in understanding user intent translates directly to business impact. The Yelp engineering team recently shared fascinating insights about their journey to production-scale LLM implementation for search query understanding—and the results speak volumes about what's possible when you get the approach right.
Their systematic framework transformed fragmented, legacy search systems into intelligent, LLM-powered solutions that now handle millions of daily searches. More importantly, they achieved up to 100x cost savings compared to direct GPT-4 implementation while dramatically improving search relevance and user engagement metrics.
What makes their approach particularly valuable isn't just the impressive scale, it's the methodical three-step framework they developed that any organization can adapt for their own LLM production challenges.
The Search Intelligence Challenge at Scale
Before diving into their solution, it's worth understanding the complexity of what Yelp's search system needs to accomplish. Every time someone types a query like "pet-friendly sf restaurants open now", the system must instantly decode multiple layers of intent: location preference, business attributes, timing constraints, and the core search topic.
According to the Yelp team, their legacy systems were handling this through fragmented approaches, multiple different systems stitched together, each addressing pieces of the puzzle but lacking the intelligence to truly understand nuanced user intent. The challenge wasn't just technical; it was directly impacting user experience and business metrics.
Consider the subtlety required for a search like "dinner before a broadway show". Traditional keyword matching might find restaurants near theaters, but it takes genuine language understanding to surface review snippets mentioning "pre-show dinner" or highlight businesses that specifically cater to theater-goers with appropriate timing and service styles.
The stakes were high: with millions of daily searches, even small improvements in query understanding could translate to significant increases in user engagement, business discovery, and ultimately revenue for both Yelp and the businesses on their platform.
The Strategic Decision Point
Rather than attempting a wholesale replacement of their search infrastructure, the Yelp team made a crucial strategic decision: they would focus on specific query understanding tasks where LLMs offered clear advantages over traditional approaches, and where the unique characteristics of search queries could work in their favor.
They identified several key advantages that made query understanding an ideal testing ground for LLM implementation:
Query-level caching potential: Unlike real-time conversation systems, search queries can be pre-processed and cached, dramatically reducing the need for real-time LLM calls.
Power law distribution: A relatively small number of popular queries account for a large percentage of total search volume, meaning you can achieve significant coverage by pre-computing responses for head queries.
Manageable text volumes: Query understanding tasks involve short inputs and outputs, making them cost-effective for LLM processing.
The team zeroed in on two specific applications that would serve as their proving ground: query segmentation (breaking queries into semantic components like location, topic, and timing) and review highlights (generating expanded phrase lists to match relevant review snippets).
The Three-Step Production Framework
Step 1: Formulation and Rapid Prototyping
The first phase focused on determining whether LLMs were actually the right tool for each specific problem. The Yelp team started with the most powerful models available that is GPT-4, to quickly prototype and iterate on prompt design without worrying about cost or latency constraints.
During this phase, they made several critical discoveries that shaped their entire approach. For query segmentation, they realized that traditional Named Entity Recognition approaches were too rigid, while LLMs offered the flexibility to handle the nuanced ways people actually search. They settled on six semantic classes: topic, name, location, time, question, and none.
More importantly, they discovered that related tasks could be combined effectively. Spell correction and query segmentation, which had been handled by separate systems, could be accomplished together by a sufficiently powerful model. This insight would prove crucial for both technical efficiency and cost optimization.
The team also experimented with Retrieval Augmented Generation (RAG), enhancing queries with additional context like business names that had been viewed for similar searches. This helped the model distinguish between business names and common topics, a critical capability for accurate segmentation.
Step 2: Proof of Concept with Strategic Caching
Here's where the Yelp approach gets particularly clever. Rather than trying to immediately scale to all traffic, they leveraged the power law distribution of search queries to create an effective proof of concept with minimal infrastructure investment.
By caching LLM responses for only the most frequent queries above a certain threshold, they could cover a substantial portion of their traffic without the cost and latency of real-time LLM calls. This allowed them to integrate the cached responses into their existing systems and run meaningful A/B tests to validate the business impact.
The results were compelling. For query segmentation, they achieved measurable improvements in downstream applications like location intent detection and business name matching. The implicit location rewrite feature alone, which uses segmentation to refine search geographic boundaries, showed clear user experience improvements.
For review highlights, the impact was even more dramatic. By expanding search queries into semantically related phrases for matching review snippets, they achieved significant increases in Session/Search Click-Through Rates across their platforms. The impact was particularly strong for less common queries in the long tail, where traditional keyword matching struggled most.
Step 3: Intelligent Scaling with Cost Optimization
The final phase addresses the challenge that kills many LLM projects: how to scale from a promising prototype to a production system that can handle millions of queries cost-effectively.
The Yelp team developed a multi-step process that achieves up to 100x cost savings compared to direct GPT-4 implementation:
Golden Dataset Creation: They used their refined GPT-4 prompts to generate high-quality training data on a representative sample of queries, focusing on diversity and quality over sheer volume.
Fine-tuning Cascade: Instead of trying to use the most powerful model for everything, they created a cascade of increasingly efficient models. GPT-4o-mini handles the bulk of pre-computed queries (tens of millions), while even smaller models like BERT and T5 serve real-time requests for long-tail queries.
Strategic Pre-computation: For review highlights, they scaled to 95% traffic coverage through pre-computed snippet expansions using OpenAI's batch processing, storing results in optimized key-value databases for fast retrieval.
Intelligent Fallbacks: For the remaining 5% of traffic not covered by pre-computation, they use the expanded phrases averaged over business categories as heuristics, ensuring no query goes completely unenhanced.

Measurable Business Impact
The results of this systematic approach speak to both the technical excellence and business value of the implementation. The review highlights system alone drove measurable increases in click-through rates across Yelp's platforms, with particularly strong performance for long-tail queries where users previously struggled to find relevant businesses.
The query segmentation improvements enabled more sophisticated features like implicit location rewriting. When someone searches for "restaurants near Chase Center" in San Francisco, the system now intelligently refines the search area to "1 Warriors Way, San Francisco, CA 94158", the actual venue location rather than the general city area.
Perhaps most impressive is the cost efficiency achieved through their scaling approach. By combining strategic caching, model fine-tuning, and intelligent fallbacks, they reduced per-query costs by up to 100x compared to direct GPT-4 usage while maintaining quality improvements across millions of daily searches.
The business impact extends beyond immediate metrics. Enhanced query understanding enables better business discovery, more relevant search results, and improved user satisfaction, all factors that drive long-term platform value and business growth.
Key Lessons for LLM Production Implementation
Start with High-Impact, Cacheable Use Cases
The most important insight from Yelp's experience is the value of choosing the right initial use cases. Query understanding worked exceptionally well because queries can be cached, text volumes are manageable, and the power law distribution makes comprehensive coverage achievable with reasonable compute investment.
Leverage Task Combination Opportunities
Don't assume that existing system boundaries should constrain your LLM implementation. The Yelp team's discovery that spell correction and query segmentation could be effectively combined not only improved accuracy but also reduced infrastructure complexity and costs.
Build a Scaling Strategy from Day One
The biggest mistake organizations make with LLM projects is treating scaling as an afterthought. Yelp's cascade approach from expensive prototyping models to cost-optimized production models, should be planned from the beginning, not retrofitted after proving concept.
Use Power Law Distributions to Your Advantage
Many business applications have similar power law characteristics to search queries. Identifying these patterns in your domain can enable similar caching strategies that dramatically reduce the cost and complexity of LLM deployment.
Future-Proofing Production AI Systems
What's particularly forward-thinking about Yelp's approach is how they've built adaptability into their system architecture. As new LLM capabilities emerge, like the advanced reasoning capabilities in models like GPT-4o and o1, their framework can incorporate these improvements without requiring complete system redesigns.
They're already experimenting with reasoning models for more complex query understanding tasks, while maintaining their core principle of systematic validation and gradual scaling. This balance between innovation and operational stability offers a template for organizations looking to stay at the forefront of AI capabilities without sacrificing system reliability.
The framework also demonstrates how to extract maximum value from AI investments by building systems that improve over time. The expanded phrases generated for review highlights now feed back into ranking model improvements, creating a virtuous cycle where better understanding leads to better results, which generates better training data for future improvements.
Conclusion: A Blueprint for Production AI Success
Yelp's journey from LLM experimentation to production-scale implementation offers a practical blueprint for organizations serious about deploying AI at scale. Their three-step framework that is formulation, proof of concept, and intelligent scaling, provides a systematic approach that balances innovation with operational realities.
The key insight isn't just about the technical implementation, but about the strategic thinking that made it successful: choosing the right problems, designing for scale from the beginning, and building systems that improve over time rather than simply replacing existing functionality.
For organizations looking to move beyond AI pilot projects to production impact, the Yelp experience demonstrates that success comes not from having the most advanced AI, but from having the most thoughtful implementation strategy.
VegaStack Blog
VegaStack Blog publishes articles about CI/CD, DevSecOps, Cloud, Docker, Developer Hacks, DevOps News and more.
Stay informed about the latest updates and releases.
Ready to transform your DevOps approach?
Boost productivity, increase reliability, and reduce operational costs with our automation solutions tailored to your needs.
Streamline workflows with our CI/CD pipelines
Achieve up to a 70% reduction in deployment time
Enhance security with compliance automation