Walmart weekend sales to predict future store sales

This is a Kaggle competition held to predict the Walmart future sales for each individual department in a store. https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting

Time series models have several use cases in predicting the weather conditions, stock movement, future sales. Here I have approached various methods and finally opted for Random forest as the time-series models didn’t perform well.

Problem Statement:

One challenge of modelling retail data is the need to make decisions based on limited history. If Christmas comes but once a year, so does the chance to see how strategic decisions impacted the bottom line.

In this recruiting competition, job-seekers are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains many departments, and participants must project the sales for each department in each store. To add to the challenge, selected holiday markdown events are included in the dataset. These markdowns are known to affect sales, but it is challenging to predict which departments are affected and the extent of the impact

  • Identify future sales in the given store and the department.
  • This could help in warehousing the necessary goods in prior.

Business Constraints:

  • No low latency constraint.
  • Predictions should be for each individual department.

Data Overview:

Kaggle has provided a total of 4 datasets:

|STORES.CSV

This contains information about the 45 stores such as type and size of the store.

stores.csv

Store: Store Number

Type: Type of store

Size: Size of each store

|TRAIN.CSV

We have Store, Department, Weekly Sales, Date and Holiday information.

Store: We have 45 different store numbers from 1–45.

Department: This column has department numbers ( 1–99)

Date: Date of the week

Weekly Sales: This is the target data and is the total sales for the week for a given department and store.

Holiday: This is the boolean variable which indicates whether it is a Holiday or not.

|FEATURES.CSV

This has additional information about the store, department and the region for the given dates.

features.cv

Store: This is the store Number (1–45)

Date: Date of the record.

Temperature: Temperature in the region where the store is located.

Fuel Price: Fuel price in the region the store is located.

Markdown 1–5: This is the data related to promotional markdowns run by the Walmart for the sales preceding the holidays. Information is available after Nov 2011 and is not available for all the stores all the days, and the missing values are marked as NA.

CPI: This is the customer price Index Unemployment: This is the unemployment rate in the region where the store is located.

IsHoliday: Indicates whether the week is a special Holiday week.

| TEST.CSV

test.csv

This is similar to the training dataset except that the prediction sales as those are the target variable. Input will be Store, Dept, Date, IsHoliday.

In additional to the data above, we have given the weekend date of holidays also: Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

Preparing Train and Test Data

Merging the train, features, store data based on the store number and the Date.

A similar method for the test data too.

EDA and Feature Engineering

  • Bar plot for the count of Holidays
Count of Holidays
  • Sales on Holidays and Non-Holidays

Sales on Holiday weekends is ~8% higher than non-Holiday weeks.

  • Weekly sales w.r.t each Department

Dept 92 has the highest recorded weekly sales with 483 Million $ followed by Department 95 and 38.

Few of the departments have very low weekly sales record like 43, 47 etc.

  • The proportion of store types

Stores are partitioned into 3 types. Type A, B, C.

Type A stores are larger in proportionate than type B and C occupying 51%.

Type B has the second-largest proportionate occupying 38% of the stores.

  • Weekly Sales w.r.t to types of stores
  1. Weekly Sales are directly proportional to the count of stores types.
  2. Type A stores are higher and the weekly sales are also higher in these stores.
  3. All the store size of >150k can be classified as Store A and if size > 62k and <140k can be classified as Store B and rest as Store type C.
  • Distribution of Weekly_Sales

85% of the sales happened in the 10k range. And is more likely representing the power-law distribution, as almost 80% of the sales are in the first 20% of the distribution.

  • Data correlation
Pearson coefficients

Though there is less correlation it is obvious that the unemployment index increase leads to less shopping, and this goes with the fuel_price as well.

Increase and decrease in temperature affect sales. Lower temp and higher temp can make weekly sales go less.

Fixing Missing Values:

Missing Value like in ‘Markdown’, Imputing it with Zero(No Markdown). We can safely fill all missing values with zero.

For test_data there are Null values in CPI, Unemployment too. Imputing the Null values with the mean of the department.

  • Feature engineering for the date

Additionally adding the week of the month feature.

  • Weekly sales w.r.t week of the year

Sales in the 51st, 47th week is significantly higher, and this is because of Christmas days, Thanksgiving days respectively.

  • Temperature effect in sales w.r.t to each Month

Observation:

1. The highest decrease of sales is in between 8th to 9th month that is because of the the temperature fall.
2. As the temperature increases the sales are increasing in most cases, but inversely the sales are increasing in the months of 11,12 th month inspite of the temperature decreasing because of the Holidays.
  • Fuel_price effect in sales w.r.t to each Month

Observation:

1. Fuel price ranges from 2.6-4.25 and the weekly sales is constant less than 200k
2. Average sales is high when the Fuel_price is less than 3.25
3. During the month of May the fuel_price is high due to which the sales have gone low.
4. There is a decrease in sales of 3.84% between the 2nd and 3rd month, this may be beacause of the increase in Fuel price and the increase in temperature.
  • Adding special days as a feature with a boolean value. These dates are already specified in the Kaggle.

Let’s check the average sales on special Holidays and non-special Holidays.

Average sales during the Special Holidays increased by ~10% than the average sales overall year.

Let’s check the average sales on Markdown days ( without considering the special Holidays) and normal days.

Avergae sales during the Markdown days without considering the Holidays is slightly higher than non Markdown days.

Let’s check the average sales on each special Holiday.

1. Average sales on Thanksgiving day is significantly higher. 
2. It is 50% higher than the sales made on Christmas day.
3. It is 40% higher than the sales made on a non Holiday weekend.
4. Average sales on Christmas day is 10% less than the non-Holiday day.

Checking which department got the highest sales during the Thanksgiving day as this may help in stocking up more quantity for that particular department.

Checking which store got the highest sales during the Thanksgiving day as this may help in stocking up more quantity in that particular store.

Final Observation:

1. Sales on Holiday weekends are 8.2% higher than non-Holiday weekends.
2. Sales on the days before the Special Holidays is 120% higher inspite of being the non-Holiday weekends than the Holday weekends.
3. Department 92 has highest sales and 43,47 has very least sales, that may be because of the necessaity of the stock in these department 92, and the least used stock in 43,47 departments. And also the size of these department is large.
4. Stores are of 3 types amongst which store A has highest sales and the proportion of the store is also high.
5. There is an increase of 2.4 % sales in Week 1,4 compared to other weeks.
6. There is an increase of 60% sales in the last week of the year.
7. During the Thanksgiving day the average sales is 80% higher, and is 50% higher than Christmas days.
8. As the fuel price increase the sales has decreased by 3.84% in the months of 2,3 and the temperature has also increased during this period.
9. There is an increase of 15.96% sales in the month of November, December.
10. There is an increase of 4.79% sales in the month starting then month ending.

Machine Learning Models:

Final features that are used to train our model to predict the weekly_sales are as follows:

  • Store — the store number
  • Dept — the department number
  • Month: The month as January=1, December=12.
  • Year: The year of the DateTime.
  • Day: The days of the DateTime.
  • Temperature: the average temperature in the region
  • IsHoliday: If Holiday = True == 1, else 0
  • Size: the size of the store
  • Types: Types of store, A = 1, B = 2, C = 3

Metric:

As the final metric used during the evaluation in Kaggle is Weighted mean absolute error which is not present in Sklearn we are defining it as a custom function to measure the error rate. Weight on the holidays is 5 and on other days it is 1.

Various Model Tried with Hyperparameter Tuning:

  • Linear Regression
  • Decision Trees
  • Random Forest
  • XGBoost
  • AdaBoost
  • Ensembling 20 Models
  • Prophet
  • LSTM

The final model which gave the best WMAE score is with Random forest trees. Approaches to the various models can be found in the Github link given at the bottom.

As the WMAE is not a defined metric in Sklearn we cannot use Gridsearch/ Randomsearch here. Hence built a function which takes parameters for moels and displays the wmae score for each parameter. I have defined 3 different functions to take the parameters from the previous block and passing to the current block with the additional parameters.

n_estimators=[30,50,60,100,120]
max_depth=[30,35,40,45,50,60]

results=Random_forest1(n_estimators, max_depth)
n_estimators: 30 , max_depth: 30 WMAE: [1549.86]
n_estimators: 30 , max_depth: 35 WMAE: [1549.41]
n_estimators: 30 , max_depth: 40 WMAE: [1555.01]
n_estimators: 30 , max_depth: 45 WMAE: [1550.41]
n_estimators: 30 , max_depth: 50 WMAE: [1544.98]
n_estimators: 30 , max_depth: 60 WMAE: [1543.29]
n_estimators: 50 , max_depth: 30 WMAE: [1536.07]
n_estimators: 50 , max_depth: 35 WMAE: [1543.01]
n_estimators: 50 , max_depth: 40 WMAE: [1531.78]
n_estimators: 50 , max_depth: 45 WMAE: [1524.89]
n_estimators: 50 , max_depth: 50 WMAE: [1529.97]
n_estimators: 50 , max_depth: 60 WMAE: [1530.89]
n_estimators: 60 , max_depth: 30 WMAE: [1528.65]
n_estimators: 60 , max_depth: 35 WMAE: [1531.68]
n_estimators: 60 , max_depth: 40 WMAE: [1531.23]
n_estimators: 60 , max_depth: 45 WMAE: [1537.37]
n_estimators: 60 , max_depth: 50 WMAE: [1533.95]
n_estimators: 60 , max_depth: 60 WMAE: [1530.12]
n_estimators: 100 , max_depth: 30 WMAE: [1524.09]
n_estimators: 100 , max_depth: 35 WMAE: [1524.64]
n_estimators: 100 , max_depth: 40 WMAE: [1528.23]
n_estimators: 100 , max_depth: 45 WMAE: [1520.34]
n_estimators: 100 , max_depth: 50 WMAE: [1526.79]
n_estimators: 100 , max_depth: 60 WMAE: [1522.52]
n_estimators: 120 , max_depth: 30 WMAE: [1527.17]
n_estimators: 120 , max_depth: 35 WMAE: [1522.11]
n_estimators: 120 , max_depth: 40 WMAE: [1521.14]
n_estimators: 120 , max_depth: 45 WMAE: [1520.84]
n_estimators: 120 , max_depth: 50 WMAE: [1519.13]
n_estimators: 120 , max_depth: 60 WMAE: [1525.44]
max_features=[2,3,4,5,6,7,8,9]
n_estimators=[100,120]
max_depth=[40,45,50]

result=Random_forest2(n_estimators, max_depth,max_features)
n_estimators: 100 max_depth: 40 max_features: 2 WMAE: [7480.03]
n_estimators: 100 max_depth: 40 max_features: 3 WMAE: [4455.51]
n_estimators: 100 max_depth: 40 max_features: 4 WMAE: [2584.77]
n_estimators: 100 max_depth: 40 max_features: 5 WMAE: [1763.6]
n_estimators: 100 max_depth: 40 max_features: 6 WMAE: [1547.47]
n_estimators: 100 max_depth: 40 max_features: 7 WMAE: [1506.7]
n_estimators: 100 max_depth: 40 max_features: 8 WMAE: [1505.69]
n_estimators: 100 max_depth: 40 max_features: 9 WMAE: [1530.56]
n_estimators: 100 max_depth: 45 max_features: 2 WMAE: [7510.21]
n_estimators: 100 max_depth: 45 max_features: 3 WMAE: [4525.25]
n_estimators: 100 max_depth: 45 max_features: 4 WMAE: [2453.12]
n_estimators: 100 max_depth: 45 max_features: 5 WMAE: [1766.06]
n_estimators: 100 max_depth: 45 max_features: 6 WMAE: [1551.55]
n_estimators: 100 max_depth: 45 max_features: 7 WMAE: [1504.97]
n_estimators: 100 max_depth: 45 max_features: 8 WMAE: [1512.15]
n_estimators: 100 max_depth: 45 max_features: 9 WMAE: [1526.11]
n_estimators: 100 max_depth: 50 max_features: 2 WMAE: [7566.87]
n_estimators: 100 max_depth: 50 max_features: 3 WMAE: [4574.81]
n_estimators: 100 max_depth: 50 max_features: 4 WMAE: [2501.6]
n_estimators: 100 max_depth: 50 max_features: 5 WMAE: [1755.38]
n_estimators: 100 max_depth: 50 max_features: 6 WMAE: [1553.67]
n_estimators: 100 max_depth: 50 max_features: 7 WMAE: [1501.36]
n_estimators: 100 max_depth: 50 max_features: 8 WMAE: [1504.05]
n_estimators: 100 max_depth: 50 max_features: 9 WMAE: [1529.64]
n_estimators: 120 max_depth: 40 max_features: 2 WMAE: [7440.69]
n_estimators: 120 max_depth: 40 max_features: 3 WMAE: [4400.97]
n_estimators: 120 max_depth: 40 max_features: 4 WMAE: [2463.2]
n_estimators: 120 max_depth: 40 max_features: 5 WMAE: [1772.52]
n_estimators: 120 max_depth: 40 max_features: 6 WMAE: [1543.76]
n_estimators: 120 max_depth: 40 max_features: 7 WMAE: [1507.95]
n_estimators: 120 max_depth: 40 max_features: 8 WMAE: [1501.5]
n_estimators: 120 max_depth: 40 max_features: 9 WMAE: [1522.04]
n_estimators: 120 max_depth: 45 max_features: 2 WMAE: [7613.1]
n_estimators: 120 max_depth: 45 max_features: 3 WMAE: [4520.84]
n_estimators: 120 max_depth: 45 max_features: 4 WMAE: [2525.4]
n_estimators: 120 max_depth: 45 max_features: 5 WMAE: [1767.55]
n_estimators: 120 max_depth: 45 max_features: 6 WMAE: [1547.54]
n_estimators: 120 max_depth: 45 max_features: 7 WMAE: [1497.41]
n_estimators: 120 max_depth: 45 max_features: 8 WMAE: [1498.43]
n_estimators: 120 max_depth: 45 max_features: 9 WMAE: [1527.99]
n_estimators: 120 max_depth: 50 max_features: 2 WMAE: [7475.68]
n_estimators: 120 max_depth: 50 max_features: 3 WMAE: [4370.49]
n_estimators: 120 max_depth: 50 max_features: 4 WMAE: [2533.79]
n_estimators: 120 max_depth: 50 max_features: 5 WMAE: [1736.54]
n_estimators: 120 max_depth: 50 max_features: 6 WMAE: [1543.5]
n_estimators: 120 max_depth: 50 max_features: 7 WMAE: [1507.1]
n_estimators: 120 max_depth: 50 max_features: 8 WMAE: [1500.59]
n_estimators: 120 max_depth: 50 max_features: 9 WMAE: [1523.84]
max_features=[7,8]
min_samples_split=[2,5,7]
min_samples_leaf=[1,2,3]

result=random_forest_3(120,45,max_features,min_samples_split, min_samples_leaf)
Max_features: 7 min_samples_split: 2 , min_samples_leaf: 1 WMAE: [1503.8]
Max_features: 7 min_samples_split: 2 , min_samples_leaf: 2 WMAE: [1544.05]
Max_features: 7 min_samples_split: 2 , min_samples_leaf: 3 WMAE: [1588.4]
Max_features: 7 min_samples_split: 5 , min_samples_leaf: 1 WMAE: [1536.95]
Max_features: 7 min_samples_split: 5 , min_samples_leaf: 2 WMAE: [1556.44]
Max_features: 7 min_samples_split: 5 , min_samples_leaf: 3 WMAE: [1586.39]
Max_features: 7 min_samples_split: 7 , min_samples_leaf: 1 WMAE: [1571.91]
Max_features: 7 min_samples_split: 7 , min_samples_leaf: 2 WMAE: [1580.16]
Max_features: 7 min_samples_split: 7 , min_samples_leaf: 3 WMAE: [1600.62]
Max_features: 8 min_samples_split: 2 , min_samples_leaf: 1 WMAE: [1502.43]
Max_features: 8 min_samples_split: 2 , min_samples_leaf: 2 WMAE: [1541.59]
Max_features: 8 min_samples_split: 2 , min_samples_leaf: 3 WMAE: [1581.13]
Max_features: 8 min_samples_split: 5 , min_samples_leaf: 1 WMAE: [1529.84]
Max_features: 8 min_samples_split: 5 , min_samples_leaf: 2 WMAE: [1543.97]
Max_features: 8 min_samples_split: 5 , min_samples_leaf: 3 WMAE: [1579.69]
Max_features: 8 min_samples_split: 7 , min_samples_leaf: 1 WMAE: [1556.37]
Max_features: 8 min_samples_split: 7 , min_samples_leaf: 2 WMAE: [1563.45]
Max_features: 8 min_samples_split: 7 , min_samples_leaf: 3 WMAE: [1588.34]

On the test_data we got the score of 1503.92 wmae.

Generating the submission file by predicting with the final model:

Wohoo!! Got the score of 2754.36 !! Which is the 37th rank in the leaderboard position i.e almost in the top 5% of the Kaggle competitors.

Below are the WMAE score for all other models tried:

Note: Though the hyperparameter tuned XGBoost model gave the better score in the test_data, it didn’t perform well upon submission.

Future Work:

Markdown data may be considered for further improvement in the model. In addition to this time-series models such as ARIMA can be used for the better score. We have used prophet in this which didn’t perform well but ARIMA proved for the better time-series approach. We can also add the Holidays data to the time-series approach which will be considered in the models.

Here is the Github link of the project.

Linkedin: https://www.linkedin.com/in/rakeshreddy95/

Reference:

https://medium.com/analytics-vidhya/walmart-recruiting-store-sales-forecasting-kaggle-competition-856c72c9265a

https://www.kaggle.com/avelinocaio/walmart-store-sales-forecasting

https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

Thanks for reading so far!! If there is something missing or any new ideas/ approaches are always welcome, please comment out.

Machine Learning enthusiast, currently working in Oracle India.