Predicting Student Dropout Rates with Machine Learning

A Deep Dive into XGBoost and Neural Networks

Oct 13, 2024

Introduction

Predicting student dropout rates is a critical issue for educational institutions, as it directly affects both student success and the institution's financial health. By leveraging machine learning, we can help institutions identify at-risk students and intervene early. In this blog post, I'll take you through my journey of applying machine learning models—XGBoost and a neural network—to predict student dropout and highlight key insights from the process.

person holding sign board — Photo by Jonathan Kemper on Unsplash

Understanding the Data

The dataset used for this project contained 25,059 student records with 36 features. These features ranged from demographic information, such as gender and age, to academic performance metrics like credit-weighted average scores. After a thorough data cleaning process, I retained 11 key features for the models. These included important variables like attendance rates, contact hours, and unauthorised absences—factors that strongly correlate with student engagement and success.

To ensure the machine learning models could interpret the data properly, I scaled the numerical features and used one-hot encoding to convert categorical variables (such as gender and centre name) into binary columns. This preprocessing step ensures that the models aren't biased by differences in the ranges or types of features.

GitHub Notebook

Building the Models: XGBoost vs. Neural Networks

I employed two models to predict dropout rates: XGBoost and a neural network.

XGBoost
XGBoost is a powerful algorithm known for its speed and performance in handling structured data. It's highly interpretable, which means it offers clear insights into which features are most important in predicting student outcomes. Using GridSearchCV, I fine-tuned hyperparameters such as learning rate and max depth to optimise the model. SHAP (Shapley Additive Explanations) plots provided further insights into feature importance, showing that academic performance, attendance, and contact hours were among the top factors.
Neural Networks
Neural networks excel at capturing complex patterns in data, particularly nonlinear relationships that simpler models might miss. I built a Sequential neural network with two hidden layers, applied early stopping to avoid overfitting, and used GridSearchCV to optimise the network's structure and parameters. The neural network, while initially less interpretable than XGBoost, showed its strength in recognising patterns when I reintroduced critical features like attendance and contact hours.

Key Insights from Data Analysis

Several important trends emerged from the analysis:

Academic Performance: Students with higher credit-weighted averages were significantly less likely to drop out. This isn't surprising, as academic success often correlates with student retention.
Attendance and Engagement: Low attendance rates and fewer contact hours were strong indicators of potential dropouts. Students who frequently missed classes were more likely to disengage and struggle academically, which increases the likelihood of them dropping out.
Unauthorised Absences: The number of unauthorised absences also played a key role. Students who frequently missed classes without authorisation were at a higher risk of dropping out.

Comparing Model Performance

Both models performed well, but each had its strengths. XGBoost was slightly easier to interpret, thanks to its ability to rank feature importance directly. This makes it a great choice for institutions looking to understand the "why" behind predictions. On the other hand, the neural network showed a small performance advantage after fine-tuning and reintroducing critical features. Here’s a summary of the models’ performance:

XGBoost (After Tuning): Accuracy of 97.24%, with high recall and AUC scores, indicating strong performance in identifying students at risk of dropping out.
Neural Network (After Reintroducing Features): The accuracy climbed to 97.53%, with improved precision and recall, making it the top-performing model overall.

Both models provided actionable insights for early interventions, helping institutions to focus on students most at risk of dropping out.

Visualising the Results

To illustrate the findings, I used several key visualisations:

Histograms: These showed the imbalance in the dataset, with the majority of students completing their courses and a smaller percentage dropping out.
Box Plots: These helped identify the impact of key features like credit-weighted average and attendance percentage on dropout rates.
SHAP Plots: These explained which features contributed most to the model’s predictions, confirming that academic performance and attendance were among the top indicators.

airplane on ground surrounded with trees — Photo by David Kovalenko on Unsplash

What Could Be Improved?

Although the models performed well, there are several areas for future improvement:

Cross-Validation: Using k-fold cross-validation would provide more reliable performance metrics by reducing the risk of overfitting.
Handling Imbalanced Data: The dataset was imbalanced, with far more students completing their courses than dropping out. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) or cost-sensitive learning could be employed to improve performance on the minority class (dropouts).
Feature Engineering: While the models captured key patterns, advanced feature engineering techniques like PCA (Principal Component Analysis) or recursive feature elimination could streamline the process and improve model efficiency.

Conclusion

Both XGBoost and neural networks proved to be effective tools for predicting student dropout, with the neural network taking a slight edge in overall performance after feature reintroduction. The analysis highlighted the importance of academic performance, attendance, and engagement in determining student success. By leveraging these insights, educational institutions can take early action to improve retention and support students who may be struggling.

Moving forward, integrating more advanced techniques like cross-validation, regularisation, and better handling of imbalanced data could further enhance the models' robustness and accuracy. But even in its current form, this project demonstrates the power of machine learning to make a real-world impact on student success.

Sheldon’s Substack

Discussion about this post