Shepherd is Stripe's fork of the open-source Chronon repo. It is our feature engineering platform for computing point-in-time correct data across both offline and online ML use cases, from batch training pipelines to low-latency model serving.
A lot of the work around it lives at the intersection of Scala, Java, Python, and distributed systems. The goal is to let ML teams define features once and then use those definitions consistently across training, backfills, and online lookup.
Core Abstractions
- Source: where raw data enters the system, like event streams, snapshots, or warehouse tables.
- GroupBy: the aggregation primitive for turning raw inputs into feature values over windows or buckets.
- Join: how multiple features are combined into a wide training or serving view with point-in-time correctness.
- StagingQuery: arbitrary Spark SQL for the pre- or post-processing that does not fit neatly elsewhere.
How It Works
The real win is that a user does not have to think in completely separate systems for historical backfills, fresh streaming updates, and low-latency feature lookup. Shepherd provides a DSL that abstracts most of that complexity away and orchestrates the right combination of underlying systems.
- Spark handles large batch processing for training and historical computation.
- Flink supports real-time, low-latency streaming updates.
- Airflow orchestrates the workflows.
- Kafka carries raw event streams.
- Key-value storage serves online feature lookup.
- Hive or Iceberg stores batch output in the data lake.
Architecturally, the system ends up looking a lot like Lambda Architecture: one path optimized for batch correctness and another for fresh streaming updates, both feeding a shared view of the world that models can rely on.