A Machine Learning Approach to Predictive Modelling of Student Performance

Hu Ng*, Azmin Alias bin Mohd Azha, Timothy Tzen Vun Yap, Vik Tor Goh

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)
24 Downloads (Pure)


Background - Many factors affect student performance such as the individual's background, habits, absenteeism and social activities. Using these factors, corrective actions can be determined to improve their performance. This study looks into the effects of these factors in predicting student performance from a data mining approach. This study presents a data mining approach in identify significant factors and predict student performance, based on two datasets collected from two secondary schools in Portugal.

Methods - In this study, two datasets are augmented to increase the sample size by merging them. Following that, data pre-processing is performed and the features are normalized with linear scaling to avoid bias on heavy weighted attributes. The selected features are then assigned into four groups comprising of student background, lifestyle, history of grades and all features. Next, Boruta feature selection is performed to remove irrelevant features. Finally, the classification models of Support Vector Machine (SVM) , Naïve Bayes (NB) , and Multilayer Perceptron (MLP) origins are designed and their performances evaluated.

Results - The models were trained and evaluated on an integrated dataset comprising 1044 student records with 33 features, after feature selection. The classification was performed with SVM, NB and MLP with 60-40 and 50-50 train-test splits and 10-fold cross validation. GridSearchCV was applied to perform hyperparameter tuning. The performance metrics were accuracy, precision, recall and F1-Score. SVM obtained the highest accuracy with scores of 77%, 80%, 91% and 90% on background, lifestyle, history of grades and all features respectively in 50-50 train-test splits for binary levels classification. SVM also obtained highest accuracy for five levels classification with 39%, 38%, 73% and 71% for the four categories respectively. The results show that the history of grades form significant influence on the student performance.

Original languageEnglish
Article number1144
Publication statusPublished - 23 May 2022


  • Data mining
  • Multilayer perceptron
  • Naïve bayes
  • Student performance
  • Support vector machine

ASJC Scopus subject areas

  • General Biochemistry,Genetics and Molecular Biology
  • General Immunology and Microbiology
  • Pharmacology, Toxicology and Pharmaceutics(all)


Dive into the research topics of 'A Machine Learning Approach to Predictive Modelling of Student Performance'. Together they form a unique fingerprint.

Cite this