Water is one of the most critical natural resources, essential for life sustainability, supporting ecosystems, and enabling socio-economic development [1]. As the population grows worldwide, and urbanisation and industrialisation accelerate, the demand for clean and safe water has significantly increased [2]. However, the quality of water and its sources across the globe has been deteriorating due to pollution from domestic sewage, industrial discharge, agricultural activities and runoff, human-induced factors, and rapid urbanisation [3]. As a result of water pollution, human beings have begun to suffer from a variety of health problems, including skin disease, diarrhoea, dysentery, respiratory illnesses, anaemia, complications in childbirth, and other health issues [4]. As water pollution increases, real-time monitoring and accurate prediction of water quality have become essential to ensure public health, environmental protection, and regulatory compliance [5]. Traditional water quality monitoring methods often rely on manual sampling and laboratory analyses, and are time-consuming, labour-intensive, and may not provide real-time results about water quality [6]. The traditional approaches involve laboratory testing using chemical or biological methods to measure parameters such as pH, hardness, sulfates, chloramines, solids turbidity, and others. Sensor technologies and the Internet of Things (IoT) have significantly improved data collection in water monitoring systems [7].
Nevertheless, as the need for precision and accuracy increases, most data generated from many sensors require robust computational models for real-time interpretation and actionable decision-making [8], [9]. Moreover, the increased complexity and volume of water quality data often collected through real-time sensors may necessitate advanced data-driven approaches to ensure timely and accurate assessment. In response to these challenges, the integration of Machine Learning (ML) techniques into water quality analysis and prediction has gained significant attention, offering the potential for efficient, accurate, and real-time monitoring solutions [10], [11].
This work employs a number of ML models to assess the quality of water and predict its potability. A dataset available online is used to train the developed models. Four ML algorithms, namely, Random Forest (RF), XGBoost, Logistic Regression (LR), and Deep Learning (Multilayer Perceptron or MLP), will be employed to assess and predict water potability for the given dataset. Performance was compared based on metrics such as precision, accuracy, recall, F1-score, and AUC (area under the receiver operating characteristic curve).
The contributions of this study are:
Feature selection via importance analysis: A feature importance analysis was used to select a smaller number of impactful features to build the ML models. This approach makes the models lightweight with reduced computational overhead, and hence, more suitable for real-time deployment.
Performance enhancement through preprocessing: The performance of the models was enhanced using class balancing, hyperparameter tuning, and feature engineering.
Real-Time application deployment and graphical user interface (GUI) integration for practical use: The best performing model was integrated into a user-friendly water potability prediction application to classify water as either “safe to drink” or “unsafe to drink” in real-time.
The rest of this paper is structured as follows. Section II provides a literature review of the subject. The methods are discussed in Section III. Section IV provides simulation results and discussion. Conclusions are highlighted in Section V.
ML techniques are powerful tools for analysing complex, multivariate, multi-dimensional, and nonlinear datasets. ML algorithms, including supervised, unsupervised, and reinforcement learning paradigms, have proven to be powerful in modelling complex, nonlinear relationships, which is the case with environmental data [12], [13]. Supervised learning models such as Support Vector Machines (SVM), RF, and Artificial Neural Networks (ANN) have been employed to predict various water quality parameters, including pH, dissolved oxygen, turbidity, and biochemical oxygen demand [14], [15]. For instance, the work in [16] utilised supervised ML techniques to predict water quality parameters to achieve high accuracy levels. The study applied these models, revealing their superiority over conventional techniques in capturing nonlinear relationships between water parameters. The paper in [17] demonstrated the application of unsupervised ML for anomaly detection in water treatment systems to ensure water safety.
Many studies have also validated the capability of supervised learning algorithms such as ANNs and SVMs to predict water quality. The researcher in [18] performed a study to compare ML models and emphasised the role of big data in identifying water quality.
The utilisation of Deep Learning (DL) has further enhanced the capabilities of ML in water quality analysis. DL offers significant advantages in handling complex and high-dimensional water quality data. For example, Recurrent Neural Networks (RNNs), particularly Long Short-Term Memory (LSTM) networks, have been effective in modelling temporal dependencies in water quality time series data and generating accurate water quality prediction results, as shown in [19] and [20]. The study in [21] employed Convolutional Neural Networks (CNN) with LTSM (or CNN-LSTM) to simulate parameters such as pH and dissolved oxygen, showing improved accuracy in predicting water quality dynamics.
Hybrid approaches have gained attention for their enhanced predictive power. The researchers in [22] proposed a hybrid decision tree model that outperformed standalone algorithms in short-term water quality forecasting. The study in [23] demonstrated the effectiveness of integrating CNN with LSTM networks to predict short-term fluctuations in water quality. These models excel at modelling temporal and spatial dependencies in water quality datasets, especially under varying environmental conditions, and support the development of real-time water quality monitoring systems. The study in [24] used novel hybrid algorithms to improve water quality indices, while the analysis in [25] demonstrated the feasibility of real-time classification of water quality classes using supervised ML. Such systems are capable of making real-time decisions and provide an early-warning mechanism for water quality management.
Moreover, the integration of IoT devices with ML models has enabled the development of intelligent water quality monitoring systems. IoT sensors facilitate real-time data collection, which, when analysed with ML algorithms, can provide timely insights into water quality. The work in [26] proposed a probabilistic ML model integrated with IoT sensors for water quality level estimation, demonstrating its effectiveness in real-world scenarios. ML remains a powerful tool for generating predictions and trends and providing a comprehensive understanding and solution to complex problems and systems.
Identifying the most relevant features that are required by the ML models is essential for building efficient and interpretable models. The studies in [18] and [27] emphasised the use of data-driven techniques for selecting relevant features and pollution sources. Recent studies, e.g., [28], have explored interpretable ML models to quantify the effect of multiple pollutants on water quality prediction.
Despite these advancements, challenges exist in the application of ML to water quality analysis. High-quality and comprehensive datasets remain a prerequisite for practical model training. Issues such as data scarcity, sensor reliability, and model interpretability need to be addressed to fully utilise the power of ML in this domain [29]. Ongoing research focuses on developing robust, scalable, and interpretable ML models that can operate effectively under varying environmental conditions.
ML models have transformed the field of water quality analysis and prediction by offering scalable, fast, and adaptive solutions. From classical models such as SVMs to more advanced architectures such as CNN-LSTM hybrids, these methods have shown strong potential in prediction, classification, and real-time analysis. However, a number of challenges persist. First, data quality issues such as missing values and class imbalance are observed in environmental datasets. Although some studies acknowledge these problems, they are not usually treated systematically. Second, generalisability is limited in most of the studies that rely on a dataset from a single water source or region. Hence, it is challenging to use developed models with other sources or in different locations. Third, interpretability is important, particularly in environmental applications. Yet, DL models in the literature may not necessarily be transparent, with only a few studies utilising interpretable ML or feature importance analysis [28].
This study closes these gaps by handling missing values, applying class balancing, and using feature engineering and feature importance to improve the performance and interpretability of the models. Furthermore, comparing different models indicates the relative robustness of ensemble, linear, and neural network models [23], [24]. Incorporation of additional sources of information, such as IoT sensors and climate models, and building hybrid models for varying environmental conditions, are prospects to increase the accuracy of predictions and real-time monitoring [26].
This section describes the experimental steps followed in this study, including dataset analysis and processing, performance metrics, model development, and deployment.
The water potability dataset was retrieved from an open-source repository [30]. It contains a total of 3,276 records, with 1,998 records labelled as non-potable or unsafe for drinking and 1,278 records labelled as potable or safe for drinking. The dataset includes nine features: pH, hardness, solids, chloramines, sulfate, conductivity, organic carbon, trihalomethanes, and turbidity. Potability is represented by binary values, where 1 is for potable water and 0 is for non-potable water. These classifications are based on the concentration levels of the aforementioned substances and features.
The dataset was studied and investigated thoroughly to check its quality. It was found that there are 491 missing pH values, 781 missing sulfate values, and 162 missing trihalomethane values. The missing parameter values were replaced with the mean values, which is a common method, especially when the data are fairly symmetric.
The next step was to carry out exploratory data analysis (EDA). It includes examining histogram distribution for each feature, handling missing values (either by filling or removing), generating correlation analysis (heatmap), checking class balance for potability, and generating statistical summaries by calculating the mean, median, standard deviation, and minimum and maximum values. The results of this analysis were then presented visually. Figure 1 to Figure 4 show the distributions of pH, hardness, solids, and sulfate. From those figures, it can be seen that features are not perfectly normal, but many are skewed (e.g., solids, sulfate).
Distribution of pH
Distribution of hardness [mg/L]
Distribution of solids [ppm]
Distribution of sulfate [ppm]
Figure 5 shows the correlation heatmap of the nine features besides the potability. It can be seen that some positive correlations were observed (e.g., solids and conductivity). Most other features are weakly correlated with each other and with potability. In general, the weak correlation between features means the features are independent, and each feature has its own influence on the result of potability.
Correlation heat map of the features
The dataset has class imbalance, with 1,998 records labelled as “Not Potable” and 1,278 records labelled “Potable”. This imbalance is an important consideration for designing machine learning algorithms later, as it may necessitate applying class-balancing techniques. To get a better idea about the features and their effect on potability, boxplots were generated to detect outliers, as shown in Figure 6 to Figure 9.
Boxplot for pH by potability
Boxplot for hardness [mg/L] by potability
Boxplot for chloramines [ppm] by potability
Boxplot for sulfate [ppm] by potability
From the above figures, it can be seen that for several features, the medians and distributions differ between potable and non-potable water, but often with overlap. Features such as pH, chloramines, sulfate, and trihalomethanes show visible differences.
Feature importance analysis was carried out using the Random Forest Gini Importance (also known as Mean Decrease in Impurity), which measures how each feature decreases impurity in classification trees [31]. It is a default and most commonly used technique in the scikit-learn library. The method calculates the total reduction in the impurity that each feature contributes across all decision trees in the Random Forest. Figure 10 shows the result of the feature importance analysis.
Feature importance analysis result
The features above the importance median are pH, hardness, sulfate, chloramines, and solids. Accordingly, these five features are identified as the most relevant for predicting water potability. A refined version of the dataset was generated, which consists of the cleaned data (without any missing records) with the five features along with the potability label (0, 1).
One of the most commonly used metrics for classification algorithm performance assessment is accuracy, which, at the same time, is what is often reported [32]. However, it sometimes proves to be a deceptive and risky metric, especially with imbalanced datasets, where a class has more samples for one label than the other. In the dataset used in this work, there are 1,998 entries for the “Not Potable” class and 1,278 for the “Potable” class. The “Not Potable” class is therefore the majority class, whereas “Potable” is the minority class. If left unaddressed, machine learning algorithms tend to be biased towards predicting the majority class with deceptively high overall accuracy but poor performance when detecting the minority class [33]. Moreover, a model may have high accuracy because it is predicting the primary class; however, this does not necessarily indicate that the model has learned the true patterns within the dataset. Thus, it is vital to use other metrics such as the F1-score and AUC. The F1-score is the harmonic mean of precision (the proportion of true positive results among the total positive results) and recall (the proportion of total positives that were identified), which in turn indicates that the model is performing well in avoiding false positives and false negatives [32]. For example, while the model can accurately classify 80 safe samples out of 100 but inaccurately labels 40 unsafe samples as safe, the precision and recall will differ, and the F1-score will provide a balanced combined measure.
This approach is crucial when the cost of misclassification is high. The AUC evaluates the ability of the model to distinguish between classes for any given classification threshold. An AUC of 0.5 would indicate random guessing, while an AUC of 1.0 would indicate that the potable samples and non-potable samples are completely distinguishable. For instance, if the Random Forest algorithm has an AUC of 0.90, that means it will be accurate in 90% of the cases in ranking a randomly chosen potable sample over a non-potable sample. Accordingly, the F1 score and AUC will be considered (besides accuracy, precision, and recall) because they provide a more in-depth and reliable picture of the algorithm’s real-world performance than accuracy alone.
In this study, four classification models were selected for their unique features in binary classification problems with tabular real-world datasets such as the water potability dataset. RF, LR, XGBoost, and a DL model, which is the MLP, provide a balance between traditional machine learning tools, advanced ensemble models, and neural network-based approaches. Below is a more detailed justification, introduced with a summary shown in Table 1. Further, Table 2 provides the key parameters for each of the four models that were implemented using Python.
Justification of model selection
|
Model |
Justification |
|---|---|
|
Random Forest |
Handles nonlinear patterns, provides built-in feature importance |
|
Logistic Regression |
Serves as a baseline due to its simplicity and interpretability |
|
XGBoost |
High accuracy and scalability, robust to noise and imbalance |
|
Deep Learning MLP |
Tests how well a neural net generalises in this domain |
Parameters of models in Python, with open-source libraries scikit-learn, xgboost, and keras
|
Model |
Library Used |
Key Parameters |
Role/Description |
|---|---|---|---|
|
Random Forest |
scikit-learn |
n_estimators = 200 |
Number of decision trees. More trees reduce variance and improve stability. |
|
max_depth = 10 |
Maximum depth of each tree. Prevents overfitting by limiting complexity. |
||
|
min_samples_split = 5 |
Minimum samples required to split a node. Controls model generalisation. |
||
|
min_samples_leaf = 2 |
Minimum samples required at a leaf node. Reduces overfitting on noise. |
||
|
Logistic Regression |
scikit-learn |
solver='liblinear' |
Optimisation algorithm suitable for small/medium datasets and binary classification. |
|
Penalty = 'l2' |
Regularisation type. Prevents overfitting by penalising large coefficients. |
||
|
C = 1.0 |
Inverse of regularisation strength. Balances bias and variance. |
||
|
XGBoost |
xgboost |
n_estimators = 300 |
Number of boosting rounds. More rounds generally improve performance but risk overfitting. |
|
learning_rate = 0.05 |
Step size shrinkage. Smaller values make learning more robust. |
||
|
max_depth = 6 |
Depth of individual trees. Balances complexity and generalisation. |
||
|
Subsample = 0.8 |
Fraction of training samples used per tree. Introduces randomness for robustness. |
||
|
colsample_bytree = 0.8 |
Fraction of features sampled per tree. Reduces correlation among trees. |
||
|
Deep Learning MLP |
TensorFlow/ keras |
Dense (64, relu) → Dense (32, relu) → Dense (1, sigmoid) |
Neural network layers: hidden layers with ReLU activation capture nonlinearity. Final sigmoid outputs probability. |
|
optimizer=Adam (lr=0.001) |
Adaptive optimiser controlling weight updates. |
||
|
loss=binary_crossentropy |
Loss function for binary classification tasks. |
||
|
epochs = 50, batch_size = 32 |
Training settings. Define how long and with what batch size the model trains. |
RF Classifier: Random Forest was chosen for its robustness, interpretability, and performance with structured data. As an ensemble of decision trees, it reduces overfitting by averaging out predictions from many trees, which makes it very useful for complex nonlinear interactions in data. Also, its structure includes a feature importance analysis, which is a valuable tool for identifying the main water quality parameters.
LR Classifier: Logistic Regression is a simple and interpretable baseline model. It is commonly used in the field of binary classification problems, and does very well when the relationship between input features and the target variable is almost linear. Also, it serves as a performance benchmark to which more complex models can be compared.
XGBoost Classifier: Extreme Gradient Boosting was selected due to its outstanding performance and speed when processing structured datasets. Its features, such as regularisation and parallel computation, make it more advanced than typical boosting algorithms. As it is known for its ability to model nonlinearity and complex variable interactions, which makes it a top-performing algorithm in real-world applications, it is a key player in this study.
Deep Learning MLP: A feed-forward neural network is a large-scale learning model that can learn complex high-level abstractions from the data. Though it does not always perform best on small to medium-sized tabular datasets, Deep Learning can outperform other models when scaled up in terms of data input and tuning. It also serves as a tool to see how well a neural architecture generalises, which is in contrast to tree-based models and linear classifiers.
As noted, the dataset has a class imbalance – an issue that must be addressed. One of the commonly used methods in this regard is SMOTE (Synthetic Minority Over-sampling Technique) [34], [35]. It generates artificial samples of the minority class by interpolating existing samples in feature space to allow the classifier to learn the decision boundary in a better way. SMOTE was applied to the training set only to avoid the data leakage problem.
The hyperparameter tuning method is used to find the best set of parameters for ML models without modifying the algorithm itself. It occurs before training. It optimises model performance in such a way that it enhances accuracy and generalisation (performance on new data), reduces overfitting or underfitting, and increases training efficiency (speed, memory usage).
Hyperparameter tuning was also conducted for the four models using GridSearchCV and 5-fold stratified cross-validation. The method tried various combinations of parameters to identify the optimal configuration according to the F1-score, improving the generalisation ability and precision of the used models.
To assess the external validity of the proposed approach, the findings of this work will be compared with the results in existing literature. For example, the use of ensemble models (RF and XGBoost) will be compared with the performance of these models shown in references such as [10], [18], [24], while the performance of SMOTE in imbalanced data will be assessed by studies such as [34], [35]. Although the dataset differs from other works in geography or water source, performance trends can provide a good understanding and validation for the methodological choices of this research.
After assessing the performance of each model, the best model will be exported (using joblib) and integrated into a user-friendly GUI application to be built with Streamlit. The application allows users to enter water characteristics and receive real-time predictions on potability (i.e., safe or unsafe to drink). More details will be provided in the next section.
This section provides the results of the four classification models that were used to predict water potability based on the five selected features. The performance of each of the four models was evaluated using a stratified train-test split (80/20) and five performance metrics. All simulations were performed using the Python programming language.
The first experiment, conducted on the refined dataset after handling missing values and selecting only the five top features, is to generate a performance comparison of the four models, namely, RF, LR, XGBoost, and DL MLP. It should be noted that the dataset used in this experiment still suffers from class imbalance.
After running the four models, performance metrics were calculated based on the obtained results. Table 3 shows a summary of the metric values for the four models.
Initial model performance comparison
|
Model |
Accuracy |
Precision |
Recall |
F1-Score |
AUC |
|---|---|---|---|---|---|
|
Random Forest |
0.6714 |
0.6339 |
0.3708 |
0.4679 |
0.6837 |
|
Logistic Regression |
0.6104 |
0 |
0 |
0 |
0.5234 |
|
XGBoost |
0.6673 |
0.6111 |
0.4021 |
0.485 |
0.6636 |
|
Deep Learning MLP |
0.6104 |
0 |
0 |
0 |
0.5 |
Table 2 above highlights the superior performance of RF and XGBoost, where they both outperform LR and DL in terms of F1-score and AUC. The poor performance of the LR and the DL neural network model is likely due to class imbalance and a lack of model complexity or tuning. Figure 11 shows the F1-score results, while Figure 12 shows the AUC score results of the four models.
F1-Score Results of the Four Models
AUC score results of the four models
From Figure 11, XGBoost slightly outperformed the RF model, while LR and DL have shown poor F1 scores due to class imbalance or underfitting problems. From Figure 12, it can be seen that the RF model has shown the best AUC (which represents the discriminative power), followed closely by XGBoost. LR and DL performed almost randomly (since AUC ≈ 0.5).
Figure 13 highlights the ROC curve comparison for the four tested models, which confirms the findings stated above about the four models. Based on the metrics and considering the importance of correctly classifying potable and non-potable water, Random Forest can be considered the best-performing model for this task, closely followed by XGBoost. Random Forest has the highest AUC (0.684) and a decent F1-score (0.468), while XGBoost has a slightly lower AUC (0.664) but a slightly better F1-score (0.485). Random Forest achieved the highest AUC score, indicating better overall class discrimination ability, and XGBoost achieved the highest F1-score, reflecting a slightly better balance between precision and recall, especially when both classes are important. Logistic Regression and Deep Learning failed completely (precision, recall, and F1-score are all zeros). This likely means the models are predicting only one class (probably “non-potable”) and not learning properly. The models still have room for improvement because the best F1-scores are below 50%. This result suggests either that the features were not fully predictive or that models need better tuning (hyperparameter optimisation) or that more complex features (e.g., interaction terms) are needed.
ROC curve comparison for the four models
Data imbalance is a known issue in classification applications that can lead to biased models favouring the majority class while hindering performance on the minority class. Among others, SMOTE is used to address this problem and increase the number of samples in the minority class. It is a popular technique for handling class imbalance by creating synthetic examples of the minority class (in this case, “potable” water) rather than simply duplicating them.
In this work, hyperparameter tuning was implemented using GridSearchCV, which is part of the scikit-learn Python library. It is a brute-force search technique that evaluates all possible combinations of specified hyperparameters using cross-validation.
Feature engineering is the process of transforming raw data into useful features for machine learning models. These engineered features capture complex interactions that may not be obvious in the default datasets. Some of the features can be selected, and new features can be created through the transformation of features. Such a process can optimise the model’s performance and make it more efficient and accurate. Four new features, using operations such as ratios, products, and differences, were added, and the performances of the four models were analysed again by incorporating the new features in the dataset. The additional new features are provided in Table 4.
The new engineered features
|
New features |
Formula |
Description |
|---|---|---|
|
Hardness_Solids_Ratio |
Hardness / Solids |
The relation between mineral content and solids |
|
Sulfate_Hardness_Ratio |
Sulfate / Hardness |
Relative concentration of sulfate to hardness |
|
pH_Hardness_Product |
pH × Hardness |
Interaction term: acidity vs. minerals |
|
Solids_Sulfate_Diff |
Solids − Sulfate |
Mass difference between solids and sulfates |
After carrying out the above improvements, the models were trained and tested again. Table 5 shows the performance metrics results for the four models. It can be seen from the table that there are noticeable improvements as a result of the SMOTE-based class balancing, hyperparameter tuning, and feature engineering (e.g., hardness/solids ratio, sulfate interactions). The F1-score and AUC for all the models have improved considerably.
Model performance comparison after class balancing, hyperparameter tuning, and feature engineering
|
Model |
Accuracy |
Precision |
Recall |
F1-Score |
AUC |
|---|---|---|---|---|---|
|
Random Forest |
0.8634 |
0.8744 |
0.8442 |
0.8533 |
0.9022 |
|
Logistic Regression |
0.8333 |
0.8536 |
0.8131 |
0.8342 |
0.8832 |
|
XGBoost |
0.8721 |
0.8923 |
0.8435 |
0.8623 |
0.9124 |
|
Deep Learning MLP |
0.8132 |
0.8211 |
0.7913 |
0.8012 |
0.8541 |
From the above table, it can be seen that XGBoost and Random Forest models have achieved a more balanced classification between the “safe” and “unsafe” water classes, with much improved F1-score and AUC. Logistic Regression and Deep Learning models have also improved significantly after balancing the dataset and feature engineering, but showed slightly lower recall and F1-score.
Figure 14 shows the F1-score results of the four models after the above adjustments and improvements, while Figure 15 shows the AUC score results of the four models after the same adjustments and improvements.
F1-Score Results of the Four Models after Improvements
AUC Score Results of the Four Models after Improvements
XGBoost has shown the highest F1-score (0.86), followed by Random Forest (0.85). Logistic Regression and Deep Learning performed slightly lower. XGBoost also has the highest AUC (0.91), which confirms its superior ability to distinguish between safe and unsafe water, and Random Forest follows closely with 0.90. Logistic Regression and Deep Learning are slightly behind. Accordingly, the XGBoost model offers the best overall balance between precision and recall. Deep Learning achieved decent performance but did not outperform ensemble methods on this tabular dataset.
In summary, XGBoost is the top-performing model with the highest AUC (0.91) and an excellent F1-score (0.86). Random Forest closely follows with an AUC of 0.90 and strong classification metrics. Logistic Regression provided a strong baseline with good performance after balancing and tuning. The Deep Learning MLP was effective but less optimal than ensemble models in this context. As a result, the XGBoost model was selected for deployment due to its strong performance, interpretability, balance of accuracy, and robustness.
Lastly, an application was created using Python joblib and Streamlit. The application provides the user with a prediction of whether the water is drinking “SAFE” or “UNSAFE” after entering the values of the required chemical properties or the feature values on the GUI. The XGBoost model was saved (exported) using the joblib library. Then, the application was built using Streamlit, which is a Python-based framework for rapid app deployment. Figure 16 shows two examples of executing the application. The app can be implemented on mobile devices, PCs, or other platforms, allowing easy visual interaction with users interested in predicting water potability.
Water Potability Prediction App – safe/unsafe water prediction outputs
To ensure the external validity of this work's findings, the performances of the used models are compared to results in existing water quality prediction research. Previous relevant studies have shown that more advanced ensemble methods, such as Random Forest and gradient boosting algorithms, e.g., XGBoost, outperform baseline linear models or logistic regression. Apart from this, class balancing methods such as SMOTE have been widely verified in environmental as well as general machine learning applications to be efficient tools against imbalanced datasets. Table 6 summarises this work's findings versus those available in recent literature. Evidence presented in Table 6 verifies that the obtained results are aligned with previous studies. Specifically, ensemble models always perform well on different datasets and geographic locations, which indicates the robustness and generalisability of the used models. Also, the use of SMOTE and hyperparameter tuning conforms to the state-of-the-art practice in the literature and further justifies their necessity in improving model fairness and prediction performance in imbalanced datasets. While the dataset used in this work differs in size and origin from the ones described in the literature, the uniform patterns of performance across studies confirm the accuracy and stability of the proposed approach.
Comparison of the proposed work findings with existing literature
|
Reference |
Task |
Models compared |
Findings |
Comparison with the proposed work |
|---|---|---|---|---|
|
Ahmed et al. [16] |
Prediction using supervised ML |
RF, SVM, ANN |
RF outperformed linear models; ensembles capt-ured nonlinear relations |
Confirms RF superiority |
|
Chen et al. [18] |
Comparative analysis on surface water prediction |
Multiple ML models |
Tree models (RF, boosting) had higher predictive accuracy |
Supports RF/XGB results |
|
Lu & Ma [22] |
Short-term water quality forecasting |
Hybrid decision-tree ensembles |
Ensemble models out-performed single models |
Supports RF/XGB results |
|
Bui et al. [24] |
Water quality index prediction |
Hybrid ML vs. standalone |
Hybrid/ensemble ML improved |
Emphasises ensemble advantage |
|
Shams et al. [10] |
Prediction with grid search tuning |
RF, XGB, others |
Boosting with tuning achieved best performance |
Aligns with GridSearch-CV improvements |
|
Fernandez et al. [34] |
Review of SMOTE |
ML models with imbalanced datasets |
SMOTE improves classifier performance |
Validates the use of SMOTE |
|
Pradipta et al. [35] |
Review of SMOTE in practice |
Multiple ML models |
SMOTE widely used for class imbalance |
Confirms the robustness of the used balancing approach |
While this study demonstrates the applicability of ensemble and boosting techniques, class balancing, and feature engineering in water potability prediction, a few limitations exist. First, the study is limited to one publicly available dataset, which constrains the geographic and temporal range of water quality parameters under consideration. Therefore, the findings may not necessarily apply fully to water sources with other characteristics. Second, the validation method is limited to held-out test data from the same dataset. Although stratified train–test splits and cross-validation minimise overfitting, more comprehensive blind testing across external datasets from other sources or under different environmental conditions is necessary for external validation. Third, interpretability is addressed through feature importance analysis on ensemble models, but advanced interpretability tools (e.g., SHapley Additive exPlanations) were not implemented in this work. These may provide greater model decision-making insight, especially in neural networks. Finally, real-world deployment concerns such as sensor fusion, data streaming, and hardware efficiency are beyond the scope of this work, although the prototype Streamlit application demonstrates practical feasibility. These limitations are areas for future investigations. Generalisation to multiple datasets, using advanced explainability techniques, and testing in realistic operating environments would enhance the validity and generalizability of the proposed models.
To show the impact of the improvement after class balancing, hyperparameter tuning, and feature engineering, one can compare the results shown in Table 3 and Table 5. As can be seen from this comparison, all four models improved their performance significantly, with Random Forest and XGBoost improving the most. For example, Random Forest F1-score improved from 0.47 to 0.85, and XGBoost from 0.49 to 0.86, while Logistic Regression and Deep Learning MLP also saw a significant improvement from near-random performance to competitive performance.
It should be noted, though, that after feature engineering, the differences between the two top models (Random Forest and XGBoost) are very small, with XGBoost being only ~1% superior in F1-score and AUC. This result means that feature engineering is the process that gets all models to a strong and comparable performance. The marginal gain of XGBoost may justify its selection as the deployment model, but Random Forest is a close second. RF may be preferred in circumstances where simpler interpretability or reduced computational cost is desirable.
This study presented a comparative analysis of four prominent machine learning models: Random Forest, XGBoost, Linear Regression and Deep Learning MLP on a certain water potability dataset based on key water quality features. LR was chosen for its simplicity and interpretability, RF for its robustness and generalisation, XGBoost for its high-performance gradient boosting, and the Deep Learning MLP model for its generalisation ability. This diverse selection offers meaningful insights into which algorithm is most suitable for water potability classification based on both performance metrics and practical deployment considerations. Initial experiments showed moderate performance, with RF and XGBoost outperforming the other two models. The traditional LR model performed less efficiently in terms of predictive accuracy and robustness. While it remains valuable for its simplicity and interpretability, it proved less effective in handling data with several features, imbalanced or nonlinear data.
To further enhance performance, class balancing, hyperparameter tuning, and feature engineering techniques were applied to create new features capturing the relationships between water properties. After these improvements, both RF and XGBoost achieved significantly better F1-scores and maintained competitive AUC scores. The Deep Learning MLP model has proved less efficient, which could probably indicate that it requires further tuning and potentially more data to generalise well. RF and XGBoost models, after tuning and feature engineering, offer strong and balanced performance for water potability prediction. A user-friendly application with a simple GUI was developed to provide real-time prediction of water potability.
This research shows that model selection should be guided by the specific characteristics of the dataset, performance requirements, and available computational resources. Future work could explore even more advanced features, ensemble methods, hybrid models, or collect additional datasets to optimise the performance of the models further and attain enhanced predictions.
The authors appreciate the support provided by the School of Engineering and Computing, American University of Ras Al Khaimah.
- ,
“Water quality analysis of River Yamuna using water quality index in the national capital territory, India (2000–2009),” ,Appl Water Sci , Vol. 1 (3–4), :147-1572011, https://doi.org/10.1007/s13201-011-0011-4 - ,
“Assessment of groundwater quality for drinking, irrigation, and industrial purposes using water quality indices and GIS technique in Gorgan aquifer,” ,Desalination Water Treat , Vol. 320 , :1008212024, https://doi.org/10.1016/j.dwt.2024.100821 - ,
“Water Pollution and its Impact on the Human Health,” ,Journal of Environment and Human , Vol. 2 (1), :36-462015, https://doi.org/10.15764/EH.2015.01005 - ,
“Water quality monitoring: from conventional to emerging technologies,” ,Water Supply , Vol. 20 (1), :28-452020, https://doi.org/10.2166/ws.2019.144 - ,
“Study on traditional water quality assessment methods,” ,Assessment and Management Decisions , Vol. 1 (1), :41-522024 - ,
“Assessment of Water Quality Parameters in Real-Time Environment,” ,SN Comput Sci , Vol. 1 (6), :3402020, https://doi.org/10.1007/s42979-020-00368-9 - ,
“Smart Water Quality Monitoring System Using Iot Technology,” ,International Journal of Engineering & Technology , Vol. 7 (4.36), :636-6392018, https://doi.org/10.14419/ijet.v7i4.36.24214 - ,
“IoT based smart water quality monitoring system,” ,Global Transitions Proceedings , Vol. 2 (2), :181-1862021, https://doi.org/10.1016/j.gltp.2021.08.062 - , “Real-time water quality monitoring system using Internet of Things,”, 2017 International Conference on Computer, Communications and Electronics (Comptelix), 2017
- ,
“Water quality prediction using machine learning models based on grid search method,” ,Multimed Tools Appl , Vol. 83 (12), :35307-353342023, https://doi.org/10.1007/s11042-023-16737-4 - ,
“Enhancing water quality prediction: a machine learning approach across diverse water environments,” ,Water Quality Research Journal , Vol. 60 (1), :298-3172025, https://doi.org/10.2166/wqrj.2025.083 - , “Water quality prediction: Multi objective genetic algorithm coupled artificial neural network based approach,”, 2017 IEEE 15th International Conference on Industrial Informatics (INDIN), 2017
- ,
“Machine Learning Models for Water Quality Prediction: A Comprehensive Analysis and Uncertainty Assessment in Mirpurkhas, Sindh, Pakistan,” ,Water (Basel) , Vol. 16 (7), :9412024, https://doi.org/10.3390/w16070941 - ,
“Water quality prediction: a data-driven approach exploiting advanced machine learning algorithms with data augmentation,” ,Journal of Water and Climate Change , Vol. 15 (2), :431-4522024, https://doi.org/10.2166/wcc.2023.403 - ,
“Predicting Urban Water Quality with Ubiquitous Data - A Data-driven Approach,” ,IEEE Trans Big Data , 12020, https://doi.org/10.1109/TBDATA.2020.2972564 - ,
“Efficient Water Quality Prediction Using Supervised Machine Learning,” ,Water (Basel) , Vol. 11 (11), :22102019, https://doi.org/10.3390/w11112210 - , “Anomaly Detection for a Water Treatment System Using Unsupervised Machine Learning,”, 2017 IEEE International Conference on Data Mining Workshops (ICDMW), 2017
- ,
“Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data,” ,Water Res , Vol. 171 , :1154542020, https://doi.org/10.1016/j.watres.2019.115454 - ,
“Water quality prediction using LSTM with combined normalizer for efficient water management,” ,Desalination Water Treat , Vol. 317 , :1001832024, https://doi.org/10.1016/j.dwt.2024.100183 - , “Water quality prediction method based on LSTM neural network,”, 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), 2017
- ,
“Prediction of Water Level and Water Quality Using a CNN-LSTM Combined Deep Learning Approach,” ,Water (Basel) , Vol. 12 (12), :33992020, https://doi.org/10.3390/w12123399 - ,
“Hybrid decision tree-based machine learning models for short-term water quality prediction,” ,Chemosphere , Vol. 249 , :1261692020, https://doi.org/10.1016/j.chemosphere.2020.126169 - ,
“Short-term water quality variable prediction using a hybrid CNN–LSTM deep learning model,” ,Stochastic Environmental Research and Risk Assessment , Vol. 34 (2), :415-4332020, https://doi.org/10.1007/s00477-020-01776-2 - ,
“Improving prediction of water quality indices using novel hybrid machine-learning algorithms,” ,Science of The Total Environment , Vol. 721 , :1376122020, https://doi.org/10.1016/j.scitotenv.2020.137612 - ,
“A review of the application of machine learning in water quality evaluation,” ,Eco-Environment & Health , Vol. 1 (2), :107-1162022, https://doi.org/10.1016/j.eehl.2022.06.001 - ,
“Water quality level estimation using IoT sensors and probabilistic machine learning model,” ,Hydrology Research , Vol. 55 (7), :775-7892024, https://doi.org/10.2166/nh.2024.048 - ,
“Spatial water quality assessment of Langat River Basin (Malaysia) using environmetric techniques,” ,Environ Monit Assess , Vol. 173 (1–4), :625-6412011, https://doi.org/10.1007/s10661-010-1411-x - ,
“Interpretable Machine Learning Based Quantification of the Impact of Water Quality Indicators on Groundwater Under Multiple Pollution Sources,” ,Water (Basel) , Vol. 17 (6), :9052025, https://doi.org/10.3390/w17060905 - ,
“Water Quality Monitoring Using Machine Learning And Iot: A Review,” ,Chemical and Natural Resources Engineering Journal (Formally known as Biological and Natural Resources Engineering Journal) , Vol. 8 (2), :32-542024, https://doi.org/10.31436/cnrej.v8i2.100 - ,
“Water Potability Dataset,” , https://www.kaggle.com/datasets/adityakadiwal/water-potability, [Accessed: Apr. 10, 2025] - ,
“Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach,” ,Comput Struct Biotechnol J , Vol. 21 , :4354-43602023, https://doi.org/10.1016/j.csbj.2023.08.033 - ,
“Evaluation metrics and statistical tests for machine learning,” ,Sci Rep , Vol. 14 (1), :60862024, https://doi.org/10.1038/s41598-024-56706-x - , “Data Mining for Imbalanced Datasets: An Overview,”, Data Mining and Knowledge Discovery Handbook
- ,
“SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” ,Journal of Artificial Intelligence Research , Vol. 61 , :863-9052018, https://doi.org/10.1613/jair.1.11192 - ,
“SMOTE for Handling Imbalanced Data Problem : A Review,” ,2021 Sixth International Conference on Informatics and Computing (ICIC) , 1-82021, https://doi.org/10.1109/ICIC54025.2021.9632912

