Term Deposit Subscription Prediction
A model for predicting whether a client subscribes to a term deposit.
Template by Fadilah, powered by Bootstrap v5.1.
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
Source: UCI ML Repo > Bank Marketing Data Set
Create a model for predicting whether a client subscribes to a term deposit.
The main objective can be broken down into 4:
1. Compare the performance between each algorithm as follows:
- Logistic Regression,
- Support Vector Machine,
- Decision Tree,
- Random Forest,
- K-Nearest Neighbors,
- Naive Bayes,
2. Decide which algorithm that we will use based on the best performance on the chosen metrics.
3. Analyze which part of the model needs to be improved based on overall modeling stages.
4. Predict an input of a new client based on the deployed site.
Best Model based on AUC: SVM
- Based on the AUC and F1 score, SVM Classifier has the best performance compared to other algorithms. The train test is slightly greater than the test, but it still falls within the range of ~73% - 76%.
- We also know that SVM also has the highest recall compared to other models, both on the val and train set. It means this model can capture more actual class 1 than the other models.
- SVM takes a long time to predict values, it's the 2nd slowest model in this experiment. This is due to the nature of the algorithm that involved calculating distances or more precisely, calculates the margin between vectors. The model that uses distance calculation normally takes a longer time. We can see that the bottom 2 times elapsed are SVM and KNN which also use distance measurement.
- As for the highest score of precision, we can see that is achieved by Random Forest Classifier which has precision on val: 67% and on the train: 82%. It means this model, on the train set, can be 82% accurate in predicting class 1.
- Besides the best model (SVM), we can see that the tree-family models have the 2nd-4th best AUC score on val. This might be the nature of our data that not much of the features contain linear relationships. Tree-based models are suitable for handling non-linear data.
- The AUC scores of AdaBoost are quite consistent for the val and train set, this also happened in Gaussian Naive Bayes.
- Gaussian Naive Bayes is the most consistent for all the evaluation metrics scoring on the val and train set.
To achieve greater scores on AUC, and also increase the recall and precision, we can try to use another method of encoding. We also can try to transform the outliers.