Beyond Guesswork—How Time Series Forecasting Sharpens Book Sales Predictions

Jan 11, 2025

Have you ever wondered how publishers decide how many copies of a new book to print or reorder? It may look like guesswork, but behind the scenes there’s a whole lot of number-crunching magic called time series forecasting. Join me on a quick journey through how I turned raw weekly sales data for two fan-favorite books—The Alchemist and The Very Hungry Caterpillar—into data-driven insights any publisher can use for smarter stocking and marketing.

man artwork wall decor — Photo by Fernando Jorge on Unsplash

1. Cracking Open the Data

Before I could forecast sales, I needed a clean and complete sales record. My starting point was two Excel files:

Metadata (ISBN, author, publication date, etc.)
Weekly Sales (units sold each week)

GitHub Notebook

Step 1: Merge and Resample

Merge: I combined the weekly sales file with the metadata, matching each sales record to the correct ISBN.
Resample: Some weeks weren’t recorded if sales hit zero, so I resampled at a weekly frequency, filling in any missing weeks with zero units. Now, each book had a neat, continuous timeline from 2001 to mid-2024—no missing spots or guesswork.

Step 2: Clean Up and Impute

Negative Volumes? Clipped them to zero.
Missing Retail Prices? Filled them with a median.
Categorical Nulls? Replaced them with the most frequent category.

At this point, I had a final, tidy DataFrame that effectively showed how many copies sold each week, plus all relevant metadata for each title.

2. Diving into Time Series Analysis

Armed with a squeaky-clean dataset, I got curious: do these books show any interesting trends or seasonal booms? Answering that meant exploring patterns in the data:

Weekly Sales Trends:
- Could certain years be unexpectedly high?
- Did some books spike in December each year?
Stationarity Checks:
- I ran the Augmented Dickey-Fuller test to see if differencing was required or if the data was already stable enough. Turned out both The Alchemist and The Very Hungry Caterpillar were sufficiently stationary.
Seasonal Clues (ACF/PACF):
- Autocorrelation plots confirmed strong seasonal lags—particularly around 52 weeks, matching a one-year cycle.
- Intuitively, The Very Hungry Caterpillar soared during certain times each year (e.g., back-to-school?), while The Alchemist had modest but still noticeable seasonal peaks.

3. Focus on Two Beloved Titles

I zeroed in on two specific books from 2012 onward for deeper forecasting:

The Alchemist (a philosophical mainstay)
The Very Hungry Caterpillar (a children’s classic with a history of big seasonal spikes)

Why 2012? By then, the data was more stable—massive peaks in 2001–2002 overshadowed everything else, so looking from 2012 gave a clearer view.

Models in Action

1) ARIMA and SARIMA

These are “classical” time series models. They excel at capturing trend (long-term direction) and seasonality (repetitive yearly cycles).
Auto ARIMA automated the process of picking best-fit parameters for each title.

2) XGBoost

From machine learning land, this algorithm loves engineered features like Volume_lag1, rolling means, and date-based features (e.g., month).
It can handle non-linear quirks (like sudden holiday surges) as long as they’re hinted at in the data.

3) LSTM

A type of recurrent neural network that “remembers” patterns over sequences.
Potentially great at complicated patterns, but can also be prone to over-smoothing if not tuned well.

4) Hybrid Approaches

Combining SARIMA with LSTM or XGBoost can sometimes yield incremental boosts by letting SARIMA handle the big seasonal shape and ML handle more random fluctuations.

photo of key against black background — Photo by Matt Artz on Unsplash

4. Key Findings & Insights

A) Seasonal Differences

The Alchemist
Maintains a moderate, somewhat fluctuating path with smaller seasonal peaks. Sales dipped around 2020 (hello, pandemic disruptions) but recovered somewhat.
The Very Hungry Caterpillar
Showed much bigger cyclical surges, presumably tied to holiday gifts, academic calendars, or marketing pushes. Also dipped in 2020 but sprang back strong soon after.

B) Weekly vs. Monthly Granularity

Weekly:
- Grabs short-term sales jolts (like a surprise classroom adoption).
- More detail, can be noisier.
Monthly:
- Smooths random spikes, offering a simpler, big-picture view.
- Great for longer-term printing/budget decisions, but might miss sudden sales bursts.

C) Which Model Reigned Supreme?

SARIMA often outperformed ARIMA, especially when we accounted for a seasonal cycle.
XGBoost sometimes handled quick changes better, but it needed a robust set of features.
LSTM was powerful in theory but demanded extensive tuning to keep from missing big peaks.

D) Hybrid Gains

In some runs, mixing SARIMA’s seasonal skill with a residual-based LSTM (or XGBoost) shaved off a bit more error—especially in Caterpillar’s strongly seasonal case.
Hybrid results weren’t massive leaps but pointed to more potential if I integrated external data (like marketing campaign dates, holiday events).

5. Final Thoughts & Future Steps

1) Seasonality is King
Both titles revolve around that roughly 52-week cycle—publishers can use this knowledge to plan reprints right before demand spikes.

2) Granularity Choice
If you’re managing short-term promotions or you see abrupt weekly surges, weekly forecasting is your friend. If it’s big-picture planning you’re after, monthly data might be enough.

3) Hybrid Approaches
Even modest improvements might be worth the extra modeling complexity. If you can add marketing or holiday flags, you’ll likely see bigger gains.

4) Keep Experimenting
Try advanced neural networks (e.g., Transformers) or deeper ML pipelines with more exogenous features. The best solutions often come from layering domain knowledge—like a holiday code that triggers an extra 20% to 40% sales bump at year-end.

Ready to Revolutionize Your Warehouse?

Time series forecasting won’t let you literally time-travel, but it can help publishers cut down on guesswork and harness data for real decisions—like how many copies to print, when to promote, and what to expect in a future where The Very Hungry Caterpillar might keep on munching sales each spring.

So, whether you’re a data enthusiast or a curious publisher, let these results inspire you to explore your own sales data, test different granularities, and see how advanced modeling can transform your supply chain. After all, being able to predict demand is almost like seeing tomorrow’s bestsellers—today.

Sheldon’s Substack

Discussion about this post