Online Shoppers Purchasing Intention Project
The objective of this project is to classify whether a customer on e-commerce will bring revenue or not, simply says this project will build a classification model to predict whether a customer will end up shopping or not after going through the website. In this project, there will be some model improvement to improve the performance. This project will also utilize a pipeline to integrate some of pre-processing steps and modeling steps.
You can try to predict some values here: https://fadilah-milestone2p1.herokuapp.com/
Template by Fadilah, powered by Bootstrap v5.1.
Obtained from UCI ML: link
This dataset contains user activity on a e-commerce, there are 17 predictor and a target variable that determined whether the user end-up shopping or not after going through the website (simply bring revenue or not). Of the 12,330 sessions in the dataset, 84.5% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping.
The dataset consists of 10 numerical and 8 categorical attributes. The 'Revenue' attribute can be used as the class label.
The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.
You can also see the analysis from here: link
- For each customer, If the BounceRates increase then the ExitRates will follow and vice versa. This means, if a customer just leaves the page without doing anything, the ExitRates for that customer will increase.
- If a customer has a higher PageValue compared to others, this customer will more likely to bring revenue because they have a high tendency to do shopping.
- Most of the customer that will end up buying things in the session is the returning visitor. The customer who has been on the website is more likely to buy things.
- Offer vouchers to those that have been browsing a lot of similar keywords or clicking through different pages during a session.
- Offer a special voucher for the returning visitor once they marked several similar items, or revisit the items that they’ve visited before to boost their buying urgency.
1. Convert target variable to numerical format.
2. Feature encoding for nominal and cyclical features. For nominal will use one-hot encoding and for cyclical data will use sin cos transformation.
3. Outlier treatment using several steps, such as logarithmic transformation, cap the outlier using Tukey method.
4. Binning the least frequent value of a nominal variable that has been previously encoded to some labels. After that, encode the values using a one-hot encoder.
5. Split the dataset into 70% training, 20% validation (separate validation set, different from the one in cross-validation), and 10% test set for model inference.
6. Feature Standardization Scaling and SMOTE oversampling (sampling strategy = 1/3) that integrate inside the pipeline for each base model and best model.
The evaluation metrics that will be the main concern in this project are AUC and Precision. The reason behind this decision is due to the nature of the data that has an imbalance class, thus the accuracy won’t represent the model's actual performance. As for Precision, we want to minimize the number of False Positives that will happen. This is based on the consideration of which error will bring more harm towards the company, mostly revenue and ROI (Return on Investment).
The result of this prediction model can be utilized to decide whether to give a promotional voucher or coupon or discount or even rebate towards the customer than predicted to be ended up buying things (bring revenue). The Redemption rate (the rate of costumer that redeemed a voucher/coupon) will be very low if we have a lot of False Positives, this will also affect the company's ROI since they have allocated their budget for promotion/making a campaign.
Insights from Modeling
Amongst the features/predictors, the 3 most important features are PageValues, ExitRates, and ProductRelated_Duration. These 3 features show how long they’ve been browsing several items on the website and the rate of how many people exit the page while they’re browsing. If we compare to EDA steps, this seems familiar as we previously have detected significant differences between the value for class 0 and class 1 in these features, especially PageValues and ExitRates. PageValues also has the smallest Gini impurity index since it is the most important feature compared to the other available features, even the score is much higher than the rest of the top important features. Also, if we take a look at the correlation matrix, this feature is the only one that has a medium correlation with the target variable.
The best model obtained from the best model grid search by tuning them is Random Forest.