AI Stock Ordering Management System

A retail operator needed a stock-ordering MVP that learned from purchase patterns. We shipped it in four weeks — a deep ML pipeline paired with a Claude interpretation layer — then stayed on to build Phase 2.

Timeline: 4 weeks
Status: Live, in Phase 2
Engagement: Sprint → ongoing
Stack

The problem

A pre-seed retail operator came to us with a thesis and a deadline. They believed their stock-ordering decisions — which lines to reorder, in what quantity, on what cadence — were leaving margin on the table. Their best operators ordered well. Their newest operators ordered badly. The gap between the two was costing them money every week.

The thesis was that the gap was learnable. That a system trained on the patterns in their historical sales and ordering data could close the distance — not by replacing the operators, but by surfacing the decisions a good operator would have made and explaining why. The deadline was the investor meeting six weeks away, where they needed to show that the system worked. Not in a slide deck. As a working product running on real data.

What they didn't have was time to build it the long way. Eighteen months of engineering, a hired ML team, an off-the-shelf inventory platform stitched together with custom logic — none of it fit the runway. They needed a Sprint.

The cut

The scoping call took an hour. Most of it was about what to leave out.

The temptation in a build like this is to ship the cathedral — multi-warehouse logic, complex SKU hierarchies, integrations with three different ERPs, an ML system sophisticated enough to model promotional effects, seasonality, and supply-chain disruption all at once. All of that would have been correct over time. None of it would have shipped in a month.

The cut we agreed on: one warehouse, flat SKU model, a single ERP read connection (no write — recommendations went out as flagged suggestions, not auto-submitted purchases), and a deep ML pipeline scoped to two jobs only — forecasting near-term demand per SKU and classifying SKUs into behavioural cohorts. The Claude interpretation layer would take both signals and produce a ranked recommendation list with operator-readable rationales.

What got cut: multi-warehouse, full seasonal modelling (we used a six-month rolling window only), ERP write-back, the CEO's dashboard, three of the four integrations they'd already paid for. All of it was in the Phase 2 scope from day one. None of it belonged in the Sprint.

The hardest part of the scoping call was holding the line on the dashboard. The CEO wanted a dashboard. Dashboards take time. The honest answer was that the dashboard wasn't part of the proof — the proof was whether the recommendations were any good, and that could be tested without a dashboard at all. We shipped a CSV export instead. The CEO didn't love it. Four weeks later, the recommendations were good enough that nobody mentioned the dashboard again.

A second honest point worth naming: this Sprint ran a week longer than the standard two-to-three week window. The ML pipeline was the reason. Training and evaluating deep models against messy ERP data is real engineering work, not a prompt — and the right call was to give it the time it needed rather than ship something that hadn't been tested. The client knew this from the scoping call. We agreed on four weeks before any work started.

The build

The architecture had two layers: a deep ML pipeline doing the prediction work, and a Claude interpretation layer turning predictions into language an operator could act on. Each layer was scoped to do the thing it was actually good at — and nothing else.

Weeks one and two went to the data pipeline and the deep ML layer. The client's existing ERP export was a CSV mess — denormalised, inconsistent date formats, half the SKUs without categories. The first three days were scikit-learn pipelines for cleaning, normalisation, feature engineering, and the cohort labelling used downstream. Day four through the end of week one: TensorFlow for the pattern classification — grouping SKUs into behavioural cohorts so the forecasting model could specialise rather than averaging across the whole catalogue. Week two: PyTorch for the time-series forecasting — a sequence model trained per cohort against the rolling window of the client's sales history. Both models trained, evaluated against a hold-out set, tuned, and serialised by end of week two. On data the models hadn't seen, the forecasts were directionally correct on roughly 78% of SKUs — good enough to be useful, with the override loop catching the rest.

Week three went to the Claude interpretation layer and the operator-facing UI. This was where the LLM earned its place. Claude received the model outputs — forecast, confidence interval, cohort, recent override history — and produced a single recommendation per SKU with a short, plain-English rationale. The rationale was the actual product. Operators didn't want a number; they wanted to know why. "Forecast is 12% above the trailing average for this cohort, and the last two reorders of this line undershot demand" is the kind of sentence that closes the gap between a model and a decision. The UI was simple: a list of recommendations, each with a confidence score, the rationale, and two buttons — accept or override. Every override fed back into the prompt context for the next decision. By the end of the week, the override rate had dropped from 41% to 19%.

Week four was hardening. Auth, rate limiting, logging, the handover docs, the retraining script (so the deep ML models could be re-fit monthly without engineering involvement), and a single round of polish on the rationale strings — the first-pass outputs were too long and too hedged; we tightened them.

Ship day: Thursday of week four, one day ahead of the call sheet. The system went live with one warehouse and one operator team.

What happened next

The investor meeting went the way investor meetings go when you've shown up with a working product instead of a deck — short, productive, and on the founder's terms.

What we didn't see coming was the speed at which the client wanted to keep building. The Phase 2 scope — multi-warehouse, the dashboard, the public-facing platform layer, additional model heads for promotion and seasonality — moved from "later" to "now" within the first two weeks of the MVP being live. The engagement extended into ongoing backend management, and the build that started as a four-week Sprint is still shipping eighteen months later under the same engineering lead.

The original system is still in production. The forecasting and classification models have been retrained against larger windows and improved twice; the architecture is unchanged. The Claude interpretation layer has been swapped to newer Claude models as they shipped, but the schema and the feedback loop are unchanged. The UI from week three is still what the operator team uses today.

The lesson worth naming: a Sprint isn't a deadline you force a project into. It's a discipline you apply to the project. This one needed four weeks instead of three because the ML pipeline was the hard part and shortcutting it would have shipped a system that hadn't been properly tested. The honest version of the Sprint is the one that says so on the scoping call, not the one that promises three weeks and slips into eight. The proof at the end of week four was real because nothing in the build had been rushed past the point where it would hold up.

“The MVP wasn't the goal. The MVP was the proof that the goal was worth building.”

> phoenix://book

Have a build that looks like this?

Book a scoping call and we'll work out whether a Sprint is the right cut.

Book a scoping call

30 minutes. No pitch.