Online Shoppers Purchasing Intention

Abdullah Almokainzi
Analytics Vidhya
Published in
4 min readDec 3, 2020

--

The recent outburst of online shopping has added a new dimension in the business sector. People tend to explore online to find the items they need and buy through online transactions. It has made their life easier and more comfortable. At the same time, it has become a great need for sellers to know the patterns and intentions of different types of online customers. The customer’s purchase intention can be predicted by analyzing the history of the customers. Online shopping behavior’s data has been analyzed to build a classification model to predict their purchase intention. Different classification models have been analyzed such as Passive Aggressive, SVM and Random Forest to predict whether a customer, visiting web pages of an online shop, will end up with a purchase or not. The results show that Random Forest scored the highest accuracy and precision than the compared models.

The dataset consists of feature vectors belonging to 12,330 sessions. It was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period. Of the 12,330 sessions in the dataset, 85% (10,422) were negative class samples that did not end with shopping, and the rest (1908) were positive class samples ending with shopping. The dataset consists of 10 numerical and 8 categorical attributes. The Revenue attribute will be used as the class label.

Passive Aggressive Classifier

Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector. The Passive Aggressive algorithm is perfect for classifying massive streams of data (e.g. Twitter). It’s easy to implement and very fast, but does not provide global guarantees like the support-vector machine (SVM).

Passive Aggressive Classifier confusion matrices on the testing and validation sets

In our case, the Passive Aggressive Classifier scored 84% accuracy on the test set as well as on the validation set.

Support Vector Classifier

Support Vector Machines classification process

Support Vector Machines (SVMs) are a supervised learning method primarily used for classification, and also suitable for certain regression tasks. The basic SVM algorithm is one of a binary linear classifier, which categorizes unseen data points into one of two groups based on a set of labelled “training” points, drawing a linear division between the two categories.

Support Vector Classifier confusion matrices on the testing and validation sets

The Support Vector Classifier scored 85% accuracy on the test set as well as on the validation set, a bit higher than the Passive Aggressive Classifier.

Random Forest Classifier

Random Forest classification process

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

Random Forest Classifier confusion matrices on the testing and validation sets

The Random Forest Classifier scored 91% accuracy on the test set as well as on the validation set, which outperformed both previous models (Passive Aggressive and Support Vector Classifiers).

So what can we do in the future, how can that be improved? Instead of accuracy, we might want to optimize for precision. We can try hyper-parameter tuning to try not predict False all the time on our target feature (Revenue). Also, we can adjust the proportion of Revenue classes in the training set by either finding more instances of True cases, up sampling True class and/or down sampling majority class (False). An additional feature engineering can yield a better performance. Random Forest’s Feature Importances should be considered. For features that are not contributing to the reduction of impurity in the tree, we could reduce them out and it might make it easier for the model to predict.

Check out the source code on GitHub!

--

--

Abdullah Almokainzi
Analytics Vidhya

Life is like a cup of tea. It’s all in how you make it!