data engineering · 2026 · ● live

KitchenSync Food Forecasting System

Live kitchen production system. ML cuts stockouts 40% and lifts service level +1.6pp, but adds +3.3pp waste. The pipeline quantifies the trade-off nightly.

view on github ↗

live streamlit dashboard open dashboard ↗

This is what kitchen staff cook from all shift — live Kitchen and Chicken production queues plus running waste and sales metrics, refreshed every 5 minutes.

live power bi dashboard open dashboard ↗

Stores ranked by waste % level, drill-through to individual store stats — sales/waste trends and highest-stockout items — plus an item performance page sliced by category or store.

live results day 36 of a/b test

2026-06-17 through 2026-07-26

service level

waste rate

cumulative stats day 36

ml model

9.7% avg waste

97.5% service level

baseline

6.7% avg waste

96.7% service level

latest run · 2026-07-26

ml model

8.8% waste -0.8pp vs avg

96.8% service level -0.7pp vs avg

baseline

8.1% waste +1.4pp vs avg

96.4% service level -0.3pp vs avg

ML lifts service level +0.8pp but adds +2.9pp waste — the trade-off is the point.

how it's built

problem

Retail kitchens waste food when production outpaces demand and miss revenue when they run short. When you're forecasting the right quantity per item per store at a 15-minute grain and refreshing metrics continuously, a real pipeline is required, not a spreadsheet. But the harder question is honest evaluation: does ML actually earn its complexity cost, and if so, at what trade-off? Modeled after the Kitchen Production System (KPS) at Kwik Trip.

what i built

End-to-end simulation of a Kwik Trip-style Kitchen Production System running live on AWS EC2. An async FastAPI ingest API receives simulated POS events from 12 stores using Poisson arrivals, FIFO batch inventory, and slot-boundary production logic. A nightly cron at 2am UTC extracts events to Snowflake, runs a three-layer dbt pipeline, and generates per-slot demand predictions across 12 stores × 45 active items × 672 weekly slots at 15-minute grain. A Streamlit dashboard surfaces split Kitchen and Chicken production queues with 5-minute auto-refresh. A parallel A/B script puts ML against a naive hourly-average baseline under identical seeded demand. Results write to ab_results.json, commit to GitHub, and trigger this portfolio site to rebuild nightly at 3:30am UTC.

highlights

→ Discovered and fixed a conditional mean bias bug: the demand profile was computing E[X|X>0] by averaging only days with sales, inflating predictions 3–4x at low-traffic stores — fixed by dividing sum(quantity) by total days including zero-sale days
→ Tracked down a silent training failure: a day_of_week convention mismatch (Snowflake EXTRACT returns Sunday=0; ISO weekday is Monday=0) caused 2.37M training rows to have slot_quantity=0 — the model learned a constant near zero until the join condition was corrected
→ Honest A/B finding: ML cuts stockouts ~40% and lifts service level +1.6pp, but adds +3.3pp waste — 4× more production checks create 4× more minimum-batch cook opportunities; a production system would need a cost function to tune the trade-off
→ Per-store schema isolation in Neon for transactional writes; single consolidated Snowflake table for cross-store analytics and model training — same data, two structures, two different jobs
→ Predictions span 12 stores × 45 active items × 672 weekly slots at 15-minute grain, filtered to each item's time-of-day availability window; cold-start fallback for new items with fewer than 4 data points; simulator resumes from Snowflake watermark on restart
→ Slot-boundary production logic: cook decisions fire once per 15-minute slot, look-ahead = hold_time × 4 slots; batch sizes scale with RUSH_CURVE to prevent over-production at 3am and under-production at noon
→ API and simulator deployed as systemd services on EC2 with auto-restart; nightly cron chains extract → dbt → predict → A/B → git push; GitHub Actions rebuilds this site each morning

what i learned the hard way

The hardest bugs were silent ones. The model trained for weeks on 2.37M rows where slot_quantity was 0 for every row. A day_of_week convention mismatch between Snowflake's EXTRACT(DAYOFWEEK) (Sunday=0) and ISO weekday (Monday=0) meant the training join never matched. The model learned a constant near zero and I had no idea until I queried the training data directly. A separate bug inflated predictions 3–4x at low-traffic stores: the demand profile was computing a conditional mean E[X|X>0] by averaging only days with sales, rather than the true expected demand E[X] across all days. Both fixes required understanding the data at a level that unit tests would never catch.

stack

PythonFastAPIPostgreSQL (Neon)Snowflakedbt CoreLightGBMStreamlitasyncio / httpxDockerAWS EC2systemdGitHub Actionsuv