industry insights

How Yelp Cut Infrastructure Costs by 25% While Boosting Kubernetes Performance

Excerpt (249 chars): Discover how Yelp achieved 25% cost savings while boosting Kubernetes performance. Learn their proven optimization techniques and resource management strategies that delivered both savings and better results. Get actionable insights for your infrastructure.

6 min read

Copy link

Sep 28, 2025

How Yelp Cut Infrastructure Costs by 25% While Boosting Kubernetes Performance

Introduction

The challenge of scaling cloud infrastructure efficiently while controlling costs is one that keeps engineering leaders awake at night. When you're running massive Kubernetes clusters that need to respond to rapidly changing workloads, every decision about resource allocation directly impacts both performance and your bottom line. The Yelp engineering team recently shared their fascinating journey of replacing their custom autoscaling solution with AWS Karpenter, achieving a remarkable 25% improvement in spending efficiency while dramatically enhancing system responsiveness.

What makes this transformation particularly compelling isn't just the impressive cost savings, but how they solved the fundamental tension between maintaining adequate capacity for critical workloads and avoiding expensive over-provisioning. Their experience offers valuable insights for any organization wrestling with Kubernetes scaling challenges and the hidden costs of maintaining custom infrastructure solutions.

The Hidden Costs of Custom Infrastructure Solutions

Yelp had been running their internally developed autoscaler called Clusterman for years, initially designed for Mesos clusters and later adapted for Kubernetes. On paper, Clusterman seemed like a solid solution, it managed pools of nodes backed by AWS Auto Scaling Groups and maintained desired resource ratios through a configuration called the "setpoint". The tool even included sophisticated features for node recycling and scaling simulation.

However, the Yelp team discovered that maintaining this custom solution was becoming increasingly expensive in ways that weren't immediately obvious. The challenge wasn't just the engineering resources required to maintain custom code, but the operational inefficiencies that emerged as their workloads became more diverse and demanding.

The most problematic issue was what they described as an "endless cycle of scaling up and down". When Clusterman attempted to maintain the configured resource ratio, it would sometimes delete nodes to increase efficiency, causing pods to become unschedulable. This would trigger the launch of new instances, which could then disrupt the ratio again, creating a costly cycle of constant adjustment.

According to the Yelp team, the tool's interval-based approach, checking every few minutes rather than responding to real-time events, meant it often struggled to keep up with rapidly changing workload demands. For organizations running dynamic, business-critical applications, these delays translate directly into poor user experience and potentially lost revenue.

The Breaking Point: When Workload Diversity Demands Better Solutions

The tipping point came when Yelp's engineering teams began running increasingly diverse workloads with specific requirements that Clusterman couldn't efficiently handle. Machine learning teams needed different GPU configurations for various models. Stateful applications required instances in specific availability zones where their persistent volumes were located. Other workloads had strict topology constraints and affinity rules.

Each of these requirements meant creating new pools and managing additional Clusterman instances, exponentially increasing operational complexity. The Yelp team found themselves spending more time managing their autoscaling infrastructure than optimizing their actual applications.

The most telling requirement that emerged from their internal discussions was simple but profound: "Find the right instances for my dynamic workload requirements." This wasn't just about scaling up and down, it was about intelligent resource matching that could adapt to diverse, changing needs without constant manual intervention.

The business impact was becoming clear: engineering teams were being slowed down by infrastructure constraints, and the company was paying for both over-provisioned resources and the engineering time required to manage an increasingly complex custom solution.

Evaluating the Path Forward: Build vs. Buy Revisited

When evaluating alternatives, the Yelp team considered the standard Kubernetes Cluster Autoscaler but quickly realized it would present similar challenges to Clusterman. The fundamental limitation was the same: organizing nodes into groups where all nodes must be identical, which couldn't accommodate their diverse workload requirements.

AWS Karpenter emerged as the clear winner because it took a fundamentally different approach. Instead of managing predefined node groups, Karpenter evaluates pending pods and provisions exactly the right instances based on actual workload requirements. This pod-first approach meant workloads could specify their needs through standard Kubernetes mechanisms like node selectors and affinity rules, without requiring new infrastructure configurations.

The decision criteria weren't just technical, they were strongly business-focused. The team needed an autoscaler that could react to pending pods in seconds rather than minutes, maintain cost efficiency, and reduce the operational burden of managing scaling infrastructure. Karpenter's event-driven architecture promised to deliver on all these requirements while providing several additional business benefits.

Evaluating the Path Forward: Build vs. Buy Revisited

Implementation: Strategic Migration Without Disruption

The migration strategy reveals important lessons about managing infrastructure transitions in production environments. Rather than attempting a big-bang replacement, the Yelp team implemented a gradual transition that minimized risk while providing immediate feedback on performance improvements.

Their approach involved gradually scaling down Auto Scaling Group capacity while allowing Karpenter to detect and respond to the resulting unschedulable pods. This created a natural handoff mechanism where Clusterman would remove nodes at a controlled pace, and Karpenter would provision new instances based on actual workload requirements.

The success of this approach hinged on a crucial preparation step they had implemented earlier: Pod Disruption Budgets (PDBs) for all workloads. These PDBs protected applications from voluntary disruptions during the migration, ensuring business continuity throughout the transition.

To maintain visibility and control during the migration, they built a comprehensive monitoring dashboard tracking key metrics including resource costs, spot interruption rates, autoscaler efficiency, and workload error rates. This real-time visibility allowed them to course-correct quickly and demonstrate the business impact of the migration to stakeholders.

Transformative Results: Performance and Cost Improvements

The results of the migration exceeded expectations across multiple dimensions. The most striking improvement was the 25% increase in spending efficiency, essentially getting 25% more computational value for every dollar spent on infrastructure.

Performance improvements were equally impressive. Karpenter's event-driven architecture reduced response times from minutes to seconds when scaling resources. This dramatic improvement in responsiveness directly translated to better user experience during traffic spikes and more efficient resource utilization during normal operations.

The scalability improvements addressed a growing concern about Clusterman's memory-intensive approach. As clusters grew larger, Clusterman's practice of storing all pod and node information in memory created potential bottlenecks and out-of-memory risks. Karpenter's streamlined approach eliminated these concerns while providing better performance.

Perhaps most importantly for the engineering teams, Karpenter eliminated the need to create and manage separate pools for different workload requirements. Teams could now specify their compute needs directly through Kubernetes mechanisms, dramatically reducing the operational overhead of running diverse workloads.

Strategic Lessons for Infrastructure Decision-Making

Several key insights emerged from Yelp's experience that apply broadly to infrastructure management decisions:

The True Cost of Custom Solutions: While custom infrastructure solutions can seem cost-effective initially, the total cost of ownership includes ongoing maintenance, operational complexity, and opportunity costs. As workloads become more diverse, these hidden costs can quickly outweigh the benefits of custom solutions.

Event-Driven Architecture Matters: The difference between interval-based and event-driven scaling isn't just technical, it has direct business impact. Faster response times translate to better user experience and more efficient resource utilization.

Configuration Alignment is Critical: One of their most important lessons involved ensuring that external configurations (like kubelet settings and storage requirements) were properly communicated to the autoscaling system. Misalignments in these areas can completely block migrations or create inefficient resource allocation.

Monitoring During Transitions: Comprehensive monitoring during infrastructure transitions isn't just about catching problems, it's about demonstrating business value and building confidence in the new solution among stakeholders.

Strategic Lessons for Infrastructure Decision-Making

The Broader Implications for Cloud Infrastructure Strategy

Yelp's experience reflects a broader trend in cloud infrastructure management: the shift from managing infrastructure directly to leveraging intelligent, cloud-native solutions that can adapt to workload requirements automatically. This transition represents a maturation of cloud infrastructure tools to the point where they can often outperform custom solutions while reducing operational burden.

For organizations currently running custom infrastructure solutions, the key question isn't whether these tools work, but whether they're the best use of engineering resources. The opportunity cost of maintaining custom infrastructure, the features and improvements that could be built instead, often exceeds the perceived benefits of maintaining control over every aspect of the infrastructure stack.

The success of Karpenter in Yelp's environment also demonstrates the value of cloud-native solutions that are designed specifically for modern, dynamic workloads. These tools can leverage cloud provider APIs and Kubernetes primitives in ways that would be difficult and expensive to replicate in custom solutions.

Looking Forward: Infrastructure as a Competitive Advantage

Yelp's transformation illustrates how infrastructure decisions can directly impact competitive advantage. By reducing the operational overhead of managing scaling infrastructure, their engineering teams can focus more time on building features and improving user experience. The cost savings and performance improvements provide additional resources for innovation and growth.

The migration also positions Yelp to take advantage of future improvements in cloud-native infrastructure tools without additional engineering investment. As Karpenter and similar tools continue to evolve, organizations using these solutions benefit automatically from ongoing improvements.

For engineering leaders considering similar transitions, Yelp's experience suggests that the business case for moving from custom infrastructure solutions to cloud-native alternatives is becoming increasingly compelling. The combination of cost savings, performance improvements, and reduced operational complexity creates a powerful argument for embracing these newer approaches.

VegaStack Blog

VegaStack Blog publishes articles about CI/CD, DevSecOps, Cloud, Docker, Developer Hacks, DevOps News and more.

Stay informed about the latest updates and releases.

Ready to transform your DevOps approach?

Boost productivity, increase reliability, and reduce operational costs with our automation solutions tailored to your needs.

Streamline workflows with our CI/CD pipelines

Achieve up to a 70% reduction in deployment time

Enhance security with compliance automation