Using Data Science to Make Better Predictions post by John MacAdam

Using Data Science to Make Better Predictions

June 21, 2022

Recently I attempted to answer a fairly common question: Given this historical data, when should we expect to reach the next milestone?

One of the companies I support has a product impacting a lot of people. Each day this product reaches roughly 100,000 new users. I was given the past two years of data and asked "when will we reach 300 million users?" My instinct was to answer the question using a spreadsheet. So I started with Excel:

Cleaned up the two years worth of data.
Figured out the average daily number of new users added (97,956) over the entire data set.
Figured out how many more days it would take, at the current pace, to reach our goal of 300 million users.
The latest number of users reported was 219,291,500 on May 31, 2022.
So, we were going to reach our goal of 300 million users on Aug 31, 2024.

I could have stopped here. This guess was likely sufficient. However, I wanted to try answering the same question with a machine learning model, for a couple of reasons:

There is a lot of variability in the data. Daily trends, seasonal trends, etc.
I wanted an excuse to build my first machine learning model. This seemed like a good fit.

My first experience building a machine learning model

I spent some time looking around for an approach that would fit this type of data & question. I landed on Prophet, a tool built for forecasting time series data where seasonal effects (yearly, weekly, daily, holidays, etc.) are factored into the trends and predictions. Perfect.

Prophet models can be built in either Python or R. I opted to give R a shot (I have always wanted a reason to try R). So I followed the Prophet R API Documentation:

Installed the R Mac application
Installed the prophet package
Formatted the data with the two required columns: "ds" and "y" (daily timestamp and the daily numbers of users for our forecast)
Imported a CSV of the data

From this point on I told Prophet what to forecast:

Called the prophet() function to fit the data (I decided to leave on any seasonal trends)
Then I asked Prophet to make a future prediction

future <- make_future_dataframe(m, periods = 1461)

I decided to predict 1461 days, or 4 years, into the future. We should have definitely hit 300 million users by then.

After a few seconds, Prophet apparently ran over a thousand different potential outcomes and chose their best forecast (fascinating!). Then I asked R for a plot of the forecast.

plot(m, forecast)

The dark blue is Prophet's best guess at total number of users impacted over time. The light blue represents the range of possibilities, taking any possible outcome into account. The chart is somewhat difficult to read but it appears they are predicting 300 million users near the end of 2024. I run this command to see the raw forecast:

(forecast[c('ds', 'yhat', 'yhat_lower', 'yhat_upper')])

So, according to this forecast model, they are predicting we reach 300 million users on December 24, 2024. Interesting! This tells me the model must have noticed a slight downward trend over the past two years. I assume they are projecting this trend to continue into the future. Prophet can show us the seasonal components they picked up in the data with this command:

prophet_plot_components(m, forecast)

They are definitely observing differences across the months and days of the week, in addition to the overall trend. At this point, a data scientist can fine tune the prediction. However, for my purposes, this "out of the box" answer was sufficient.

What is the Answer?

So, back to answer the original question:

The "brute force" approach predicted 300 million users by September 2024.
The prediction model's best guess was 300 million users by Christmas 2024.
The forecast prediction model is guessing it will take almost four months longer than my original estimate. Seeing this difference was eye opening for me - exactly the type of insight I was hoping for.

My Takeaways

Building this forecast model was surprisingly easy. It was nice to find a tool well-suited for the question we were asking.
I appreciate Prophet's approachability. They made it easy to get a prediction without being a data scientist.
R appears to be a powerful tool. Although Python appears better supported, I am glad I used R at least once.
Machine learning is powerful. Data science can help answer a lot of questions. Or at least help us identify better questions to ask of our data.