industry insights

How Stripe Cut ML Development Time by 75% While Blocking Tens of Millions in Additional Fraud

Discover how Stripe reduced ML development time by 75% while blocking tens of millions in additional fraud. Learn their proven machine learning strategies, model optimization techniques, and fraud detection approaches. Get real-world insights on building efficient ML systems that scale and protect.

6 min read

Copy link

Dec 5, 2025

How Stripe Cut ML Development Time by 75% While Blocking Tens of Millions in Additional Fraud

The Challenge That Changed Everything

When fraud patterns shift overnight and every millisecond of payment processing delay costs real money, your machine learning infrastructure better be bulletproof. That's exactly what Stripe discovered as their global payments platform processed billions of transactions while fraudsters constantly evolved their attack methods.

According to the Stripe team, their ML applications add hundreds of millions of dollars to the internet economy each year. But behind this success was a growing infrastructure challenge: how do you rapidly develop and deploy ML features across hundreds of terabytes of data while maintaining sub-second latency requirements? The answer came through an innovative partnership with Airbnb and the adaptation of their Chronon platform into what became Shepherd, Stripe's next-generation ML feature engineering platform.

The results speak volumes: their first Shepherd-powered fraud detection model now blocks tens of millions of dollars in additional fraud annually while achieving 150ms feature freshness, a game-changing improvement in ML development speed and effectiveness.

The Scale of Stripe's ML Feature Challenge

Why Traditional Feature Engineering Wasn't Cutting It

At Stripe's scale, machine learning feature development isn't just complex, it's a business-critical bottleneck. The company's fraud detection systems, powering Stripe Radar, need to process features across massive datasets while meeting two seemingly contradictory requirements:

Ultra-low latency: Features must be retrieved instantly during payment processing, as any delay directly impacts the customer payment experience and overall API performance.

Maximum freshness: Feature values must update within milliseconds to catch rapidly evolving fraud patterns, especially when unusual transaction spikes occur that could signal coordinated attacks.

Traditional feature engineering platforms force teams to choose between these priorities. You can precompute everything for speed but sacrifice the ability to react quickly to new patterns. Or you can compute features on-demand for freshness but introduce unacceptable latency into payment flows.

The Stripe team realized they needed something entirely different, a platform that could deliver both speed and freshness while handling their unique scale of thousands of features across billions of training data rows.

The Strategic Decision Point

Evaluating the Build vs. Adapt vs. Buy Decision

When Stripe's ML Features team evaluated their options, they faced the classic infrastructure decision: build from scratch, revamp existing systems, or implement an external solution. Each path had significant trade-offs in time, resources, and risk.

The breakthrough came through an unexpected opportunity. Airbnb approached Stripe to become early external adopters of Chronon, their internal ML feature platform that they planned to open source. This presented a unique middle path, adapting a proven, production-tested platform rather than starting from zero.

Chronon offered compelling technical advantages: an intuitive Python and SQL-based API, efficient windowed aggregations, support for both online and offline feature computation, and built-in consistency monitoring. But it had never been tested at Stripe's scale or adapted for their specific latency and freshness requirements.

The decision ultimately came down to confidence in the foundational architecture. While significant engineering work would be required to adapt Chronon, the core platform had already proven itself in Airbnb's production environment.

Building Shepherd: The Technical Innovation

Solving the Dual KV Store Challenge

The first major adaptation involved reimagining how feature data gets stored and served. Chronon's default key-value store implementation couldn't meet Stripe's cost and performance requirements simultaneously.

The Stripe team's solution was elegantly simple: split storage by usage pattern. They implemented a dual KV store architecture with a lower-cost store optimized for bulk uploads (write-once, read-many) and a higher-cost distributed memcache-based store for frequent updates (write-many, read-many).

This approach delivered the best of both worlds, cost-efficient storage for historical data with high-performance serving for real-time features. The architecture allowed them to meet strict latency requirements without breaking the budget on storage costs.

The Streaming Revolution: From Events to Tiles

Perhaps the most innovative adaptation involved completely rethinking how streaming feature updates work. Chronon's default implementation stores individual events in the KV store with no pre-aggregation. At Stripe's transaction volumes, this approach would have made it impossible to meet their latency targets.

The team chose Apache Flink as their streaming platform for its low-latency stateful processing capabilities. But they needed Flink to understand Spark SQL expressions to maintain consistency between offline and online computation, a significant technical challenge.

Their breakthrough was recognizing that Chronon's feature definitions only require narrow transformations (maps and filters) with no data shuffling. This allowed them to implement Spark SQL expression support for individual Flink rows.

The real innovation came with "tiling", instead of storing individual events, they maintain pre-aggregated feature values in the Flink application and periodically flush these "tiles" to the KV store. Computing features now requires retrieving and aggregating tiles rather than thousands of individual events, dramatically reducing latency while maintaining freshness.

Scaling Offline Processing

Stripe's offline requirements pushed Chronon beyond its previous use cases. Training datasets with thousands of features and billions of rows required careful optimization and integration work.

The team built custom integrations with Stripe's highly customized Airflow setup, allowing users to simply mark GroupBys as online or set offline schedules, the system automatically handles job scheduling and orchestration.

They also expanded data source flexibility beyond Chronon's standard partitioned Hive tables to support Stripe's diverse data warehouse architecture, including unpartitioned snapshot tables and custom Iceberg writers.

Real-World Results: The SEPA Fraud Model Success

From Concept to Production Impact

The true test of Shepherd came with building a new fraud detection model for SEPA (Single Euro Payments Area) transactions. Initially planned as a hybrid approach combining Shepherd features with legacy platform features, the development experience was so streamlined that the team built an entirely Shepherd-based model.

The new SEPA fraud model demonstrates Shepherd's capabilities at scale:

Over 200 features combining batch-only and streaming data
Comprehensive monitoring including Chronon's online-offline consistency checks
Advanced training-serving skew prevention through modeling delays in offline training data

Most importantly, the model delivers measurable business impact: tens of millions of dollars in additional fraud blocked annually compared to the previous model.

Performance Metrics That Matter

The technical achievements translate directly to business value:

150ms p99 feature freshness: Nearly real-time response to changing fraud patterns
Sub-second feature retrieval: No impact on payment processing latency
75% reduction in feature development time: From concept to production deployment
Seamless scaling: Handling billions of transactions without performance degradation

Key Lessons for ML Infrastructure Teams

The Power of Strategic Partnerships

Stripe's experience demonstrates the value of strategic technology partnerships over pure build-or-buy decisions. By partnering with Airbnb on Chronon, both organizations accelerated their capabilities while contributing to the broader ML infrastructure community.

Adaptation Over Adoption

No platform works perfectly out-of-the-box at enterprise scale. Stripe's success came from thoughtful adaptation, understanding which components needed modification and which innovations could be contributed back to the open source community.

Business Impact First, Technical Elegance Second

Every technical decision was evaluated through the lens of business impact. The dual KV store architecture wasn't the most elegant solution, but it delivered the right balance of cost and performance for Stripe's needs.

Community Contribution Creates Competitive Advantage

By contributing their Flink integration and tiling innovations back to Chronon, Stripe strengthened the platform for everyone while positioning themselves as thought leaders in ML infrastructure.

The Future of ML Feature Engineering

Stripe's work with Shepherd represents more than just solving their immediate challenges, it demonstrates a new model for ML infrastructure development that balances innovation with collaboration.

As machine learning becomes increasingly critical to business operations, the lessons from Stripe's Shepherd implementation offer a roadmap for other organizations facing similar scale and performance challenges. The combination of strategic adaptation, community contribution, and relentless focus on business outcomes provides a template for successful ML infrastructure transformation.

The tens of millions in additional fraud blocked annually isn't just a metric, it's proof that the right technical architecture can deliver immediate, measurable business value while positioning organizations for future growth.

VegaStack Blog

VegaStack Blog publishes articles about CI/CD, DevSecOps, Cloud, Docker, Developer Hacks, DevOps News and more.

Stay informed about the latest updates and releases.

Ready to transform your DevOps approach?

Boost productivity, increase reliability, and reduce operational costs with our automation solutions tailored to your needs.

Streamline workflows with our CI/CD pipelines

Achieve up to a 70% reduction in deployment time

Enhance security with compliance automation