industry insights

How Databricks Cut Container Registry Costs by 80% While Handling 100x Traffic Spikes

Discover how Databricks reduced container registry costs by 80% while managing 100x traffic spikes. Learn their proven optimization strategies, cost-cutting techniques, and scaling approaches that you can apply to your own infrastructure. Get practical insights from a real success story.

5 min read

Copy link

Sep 12, 2025

How Databricks Cut Container Registry Costs by 80% While Handling 100x Traffic Spikes

The Challenge: When Success Becomes a Problem

When your serverless platform needs to download tens of millions of container images daily, traditional solutions quickly hit their breaking point. That's exactly what happened to the team at Databricks as their serverless offerings exploded in popularity.

According to the Databricks engineering team, their transition to serverless products like DBSQL and ModelServing created an unprecedented challenge. While their internal services hummed along nicely with existing open-source container registries, customer-driven serverless workloads generated traffic spikes over 100x higher than anything they'd seen before. Peak traffic could surge unpredictably as customers launched new data warehouses or ML serving endpoints, creating a perfect storm of scalability issues.

The numbers were staggering: millions of VMs provisioned daily, each pulling 10+ container images from their registry. What worked fine for controlled internal deployments was about to buckle under the weight of customer success.

The Breaking Point: Why Open Source Wasn't Enough

The existing open-source container registry setup revealed three critical business risks that no growing company can afford to ignore:

Reliability became a liability. Complex architectures with multiple dependencies like relational databases created numerous failure points. When things went wrong, the blast radius was enormous, potentially affecting thousands of customers simultaneously.

Scaling was painfully slow and expensive. The vertically scaling relational databases and remote cache instances took 10+ minutes to scale up, an eternity in the serverless world. Organizations faced an impossible choice: under-provision and risk overload, or over-provision and burn money on unused capacity.

Operational costs spiraled out of control. The CPU-intensive nature of open-source registries made them prohibitively expensive at scale. Running them for Databricks' traffic volumes would have required massive infrastructure investments with questionable ROI.

Cloud-managed alternatives presented their own challenges. Different providers offered varying quotas, limitations, and performance characteristics. For a multi-cloud operation, this heterogeneity created operational complexity and still couldn't meet the demanding requirements of true serverless scale.

The Strategic Decision: Build vs. Buy Revisited

Faced with these constraints, the Databricks team reached a critical decision point. They could continue patching existing solutions, accept the limitations of cloud-managed services, or build something purpose-designed for their specific needs.

The evaluation was clear: no existing solution could handle the unique demands of their serverless workloads while maintaining the reliability and cost-effectiveness their business required. This wasn't just a technical decision, it was a strategic investment in their platform's future scalability.

The team decided to build the Artifact Registry, a homegrown, multi-cloud container registry optimized specifically for serverless workloads.

The Solution: Artifact Registry Architecture

Design Principles That Drive Results

The Artifact Registry was built on three core principles that directly address the business challenges:

Everything scales horizontally. Instead of relying on vertically scaling relational databases, they moved metadata to cloud object storage, an existing dependency that's infinitely more scalable. Remote cache instances were eliminated in favor of effective in-memory caching.

Scaling happens in seconds, not minutes. Extensive caching for image manifests and blob requests meant that only a few new instances needed to be provisioned instead of hundreds. Auto-scaling could respond to demand in seconds, rather than the 10+ minutes required by traditional setups.

Simplicity drives reliability. The minimalist design features just one component behind the load balancer and one cloud dependency (object storage). This dramatically reduces potential failure modes compared to complex multi-component architectures.

The Solution: Artifact Registry Architecture

Technical Innovation Meets Business Impact

The Artifact Registry essentially functions as a simple, stateless, horizontally scalable web service. This architectural choice eliminated the complexity that plagued their previous setup, while delivering the performance and reliability their serverless platform demanded.

Rather than managing multiple databases, cache layers, and coordination services, the team created a solution that could handle production peak traffic without requiring scale-out in most cases. When auto-scaling was needed, it happened in seconds rather than minutes.

Implementation: Overcoming Real-World Challenges

Challenge 1: Multi-Cloud Complexity

Operating across multiple cloud providers presented unique challenges. Each cloud service has different APIs, performance characteristics, and reliability patterns. The team had to build abstractions that worked consistently across AWS, Azure, and GCP while optimizing for each platform's strengths.

Challenge 2: Cache Optimization

Designing effective in-memory caching required deep understanding of access patterns. The team analyzed millions of requests to identify hot paths and optimize cache hit rates. This wasn't just about technical performance, every cache miss translated directly to increased latency and costs.

Challenge 3: Disaster Recovery Planning

Perhaps the most critical implementation challenge was designing for cloud provider outages. When regional object storage becomes unavailable (sometimes for hours), traditional solutions simply fail. The team implemented geo-based failover that allows seamless recovery with acceptable tradeoffs in latency and egress costs.

Results: Transforming Performance and Economics

The business impact of the Artifact Registry exceeded expectations across every metric that matters:

Performance Improvements

90% reduction in P99 latency: dramatically improving user experience during peak loads
Instant scaling: from 10+ minute scale-up times to seconds
Consistent performance: even during 100x traffic spikes

Cost Optimization

80% reduction in CPU resource usage: translating to massive infrastructure cost savings
Dramatic instance reduction: from thousands of instances to just a few for the same workload
Eliminated over-provisioning waste: pay only for what you actually use

Reliability Enhancement

Minimal failure modes: simplified architecture reduces potential points of failure
Regional outage survival: automatic failover keeps services running even during cloud provider issues
Zero-downtime scaling: handle traffic spikes without service interruption

These improvements meant that Databricks could continue scaling their serverless offerings without worrying about infrastructure bottlenecks or spiraling operational costs.

Key Lessons for Growing Organizations

1. Question Assumptions About “Standard” Solutions

Just because a technology works for typical use cases doesn't mean it's suitable for your scale or requirements. Open-source solutions often optimize for flexibility over performance, which may not align with your business needs.

2. Simplicity Is a Competitive Advantage

Complex architectures create complex failure modes. The most reliable systems typically have the fewest components and dependencies. Every additional layer should provide clear, measurable business value.

3. Design for Your Specific Requirements

Cloud-managed services offer convenience, but may not optimize for your specific use patterns. Understanding your unique requirements can reveal opportunities for significant performance and cost improvements.

4. Plan for Provider Failures

Even the most reliable cloud services occasionally fail. Building resilience across regions and providers isn't just about technical reliability, t's about maintaining customer trust and business continuity.

5. Performance Improvements Compound

Small optimizations in core infrastructure can have massive downstream effects. A 90% latency reduction doesn't just improve user experience, it enables entirely new use cases and business models.

Looking Forward: The Foundation for Future Growth

The Artifact Registry has become more than just a container registry, it's a strategic foundation that enables Databricks to continue scaling their serverless platform without infrastructure constraints. The system now handles all container registry use cases across the organization, proving that purpose-built solutions can deliver both immediate results and long-term value.

As serverless computing continues to evolve, having infrastructure that can seamlessly handle unpredictable, customer-driven workloads becomes increasingly critical. The investment in building the Artifact Registry has positioned Databricks to capitalize on future growth opportunities without worrying about scalability bottlenecks.

The success of this project demonstrates that sometimes the best solution isn't to adapt existing tools, but to build exactly what your business needs to thrive.

VegaStack Blog

VegaStack Blog publishes articles about CI/CD, DevSecOps, Cloud, Docker, Developer Hacks, DevOps News and more.

Stay informed about the latest updates and releases.

Ready to transform your DevOps approach?

Boost productivity, increase reliability, and reduce operational costs with our automation solutions tailored to your needs.

Streamline workflows with our CI/CD pipelines

Achieve up to a 70% reduction in deployment time

Enhance security with compliance automation