Abstract
Keywords
Introduction
Context
The widespread theft of electricity poses a serious threat to the security and dependability of power networks and can result in significant financial losses. Energy theft is thought to cause non-technical losses (NTLs) worth 96 billion dollars annually on a global scale. In this sense, developing economies have an enormous challenge. One example of this issue is the fact that NTLs from energy theft account for between 30% and 40% of India's total electrical generation and cost over 17 billion dollars yearly (Mashima and Cárdenas, 2012). This rampant theft prevents utilities from improving power networks and achieving financial stability, creating an urgent need for more effective detection methods.
Manual checks and conventional anomaly detection methods fail to match the advanced techniques used by electricity thieves in current times. Smart grid technology deployments remain restricted in underdeveloped countries despite bringing improved smart monitoring and detection abilities to the grid. Inadequate infrastructure coupled with scarce computational resources and lack of widespread smart meter and communication network deployment are common challenges for these regions.
Underdeveloped areas present electricity consumption patterns that show strong variation because they have specific consumption behaviors shaped by environmental factors alongside socioeconomic conditions. The patterns in these regions differ notably from industrialized regions which creates obstacles for applying standardized detection models. Current detection techniques fail to yield satisfactory results in these settings making it necessary to develop flexible resource-conscious models which reliably work with limited resources in place.
The issue of electricity theft in emerging economies is not just a financial concern but also an operational challenge of considerable size. Theft leads to imbalances in demand and supply, which can place considerable strain on the grid system itself; not only does this compromise public safety by increasing risks of fire and electrical hazards, but it also worsens issues due to inefficiencies. We often see high false positive rates (FPR) in the current machine learning-based detection methods—this problem is further exacerbated by such inadequacies (Messinis and Hatziargyriou, 2018). To improve the accuracy and efficiency levels of these systems, researchers are turning toward ensemble techniques at an increasing rate. EL (ensembles learning) involves combining different machine learning algorithms as one system: it aims at improving both accuracy and efficiency levels which make them more effective compared with single algorithm approaches.
Deep learning and various machine learning methods have demonstrated their potential across numerous domains by operating on vast datasets and yielding valuable insights. These technologies enable accurate predictions and early detection of anticipated classes, significantly enhancing decision-making processes and outcomes. Applications include health management (Khadhraoui et al., 2022), disease prediction (Mondol et al., 2022), diabetic retinopathy detection (Guefrachi et al., 2024), tumor diagnosis (Raza et al., 2022), text exploration (Alsammak et al., 2025), precision agriculture (Yu et al., 2019), autonomous vehicles (Nasir et al., 2021), intelligent energy systems, smart building management (Mazhar et al., 2022), and educational advancements (Yousafzai et al., 2021). For instance, in healthcare, machine learning models can predict disease outbreaks and progression (Mondol et al., 2022), facilitating timely medical intervention. In precision agriculture, these models enable precise monitoring and management of crops (Yu et al., 2019), improving yield and reducing waste. Similarly, in autonomous vehicles, machine learning enhances navigation and safety features (Nasir et al., 2021), making transportation more efficient and secure. Overall, the integration of deep learning and machine learning into various fields revolutionizes traditional approaches, offering predictive capabilities that drive innovation and efficiency.
In light of these advancements, there has been growing interest in applying machine learning techniques to improve electricity theft detection (ETD). Recent studies have explored a variety of models, including support vector machines (SVM), neural networks (NN), and ensemble methods like XGBoost, to analyze consumption patterns and identify anomalies. However, these models often struggle with imbalanced datasets, where instances of theft are significantly outnumbered by normal usage patterns, leading to high false positive and false negative rates. To address this limitation, this study explicitly investigates the use of ADASYN (adaptive synthetic sampling) to mitigate class imbalance and enhance the predictive performance of ensemble models in electricity theft detection, with a focus on reducing false positive and false negative rates.
The main objective of this work is to develop a small deep-learning model that is suitable for locations without smart grid technology to get over these restrictions. The proposed model makes use of direct and indirect feature engineering techniques along with monthly consumer readings to enhance its forecasting capabilities. These strategies include principal component analysis (PCA) and resampling techniques like SMOTE and ADASYN (Abubakar et al., 2024; Saqib et al., 2024). The model aims to provide utility suppliers, particularly in developing nations, with a workable and expandable alternative. To do this, computing efficiency is prioritized while improving the accuracy and recall rates of theft detection. Principal component retention was determined from the training-set scree (per-component variance) and cumulative explained-variance curves. The chosen setting of 100 components corresponds to the elbow/plateau region of the curves, indicating diminishing returns from additional components.
ADASYN was selected as the resampling strategy because it adaptively generates more synthetic samples in minority regions that are harder to learn (i.e. near decision boundaries), which aligns with the overlap observed between theft and non-theft profiles. In contrast, SMOTE uses uniform neighborhood sampling and may populate relatively easier minority regions without targeting boundary complexity to the same extent. From a computational standpoint, both approaches incur similar neighborhood-construction costs on the minority class; in our pipeline, using ADASYN with two base learners (RF and LR) maintained a modest training overhead while improving minority-class recall. The study scope and reported results remain unchanged; this clarification documents the rationale for the chosen resampling. Pipeline-based machine learning for electricity-theft detection (Anwar et al., 2020) reports 89% accuracy.
This work introduces a novel method for detecting energy theft that accounts for the practical and theoretical challenges utility companies confront in low-resource environments. This study uses deep learning and sophisticated feature engineering approaches to improve the accuracy and reliability of ETD systems. It seeks to accomplish this to support the general sustainability and stability of energy distribution networks (Saqib et al., 2024).
The primary motivation behind this study stems from the urgent need to develop effective and scalable solutions for electricity theft detection, particularly in regions where smart grid infrastructure is lacking. Current ETD methods suffer from high FPR and inefficiencies that limit their effectiveness, especially in environments with resource constraints. This research seeks to bridge the gap by leveraging deep learning and advanced feature engineering techniques to create a robust and lightweight model capable of accurate theft detection in non-smart grid settings.
Furthermore, the financial and operational impacts of electricity theft on utility companies highlight the necessity for more reliable and efficient detection systems. By addressing the limitations of existing models and focusing on the practical implementation of advanced machine learning techniques, this study aims to provide utility providers with a tool that not only enhances detection accuracy but also ensures scalability and cost-effectiveness. The ultimate goal is to support the sustainable development of electricity distribution networks, reduce financial losses, and improve overall grid stability and safety.
Scientific contribution
The main contributions of the present article may be summarized as follows:
Structure of the article
The present article is structured into six distinct sections, each addressing a critical aspect of the research. The introduction provides the context and motivation for the study, highlighting the significance of electricity theft detection and the challenges faced by current methods. The literature review offers a comprehensive overview of existing research, focusing on the methodologies and technologies previously employed. The research gap section identifies specific deficiencies in the current state of the art and justifies the need for the proposed approach. The proposed methodology details the innovative ensemble-based model developed in this study, including the integration of advanced feature engineering techniques and ADASYN for handling imbalanced datasets. The results section presents the experimental findings, demonstrating the efficacy and performance of the proposed model through detailed analysis and comparison with existing methods. Finally, the conclusion and future work section recapitulates the main findings, discusses their implications, and outlines potential directions for future research to further enhance electricity theft detection systems.
Literature review
Current research
ETD has been an area of intense research due to the significant financial losses and safety hazards it poses globally. Traditional detection methods often fail to keep up with sophisticated theft techniques, leading researchers to explore innovative approaches such as state estimation, game theory, hybrid methods, and AI-based machine learning algorithms.
The primary efforts of early research lay in the realm of state estimation techniques. They used exotic external hardware equipment like specialized metering devices and distribution transformers or sensors which would help detect differences in electrical parameters between local and remote ends—thus uncovering theft instances. An example is using a modified ammeter device at the low-voltage end that would show abnormal readings if there are pilferages downstream. Unfortunately, these supplementary devices’ high implementation costs and operational overheads due to their standalone nature (as well as difficulties in harmonizing them with pre-existing systems) act as impediments to wider acceptance (Henriques et al., 2014; Jiang et al., 2014; Salinas and Li, 2015; Yan and Wen, 2021; Yip et al., 2017).
Game theory-based methods offer a different approach by leveraging strategic interactions between dishonest consumers and utility companies to deter theft. The goal is to reach a Nash equilibrium where theft activities are minimized or eliminated. Although these methods are cost-effective, accurately defining functions for each participant in the theft detection process remains challenging. Despite the theoretical potential demonstrated by researchers, practical implementation in real-world scenarios has proven complex (Cárdenas et al., 2012; Zhou et al., 2015).
Hybrid methods combine the strengths of state estimation techniques and AI to enhance detection accuracy. These methods leverage network-oriented measurements like power flow and voltage, alongside machine-learning models trained on historical consumption data. For instance, combining state-based estimation at the substation level with AI techniques at the distribution level has shown significant promise in balancing detection accuracy and cost-effectiveness (Messinis and Hatziargyriou, 2018; Shehzad et al., 2022).
Machine learning algorithms have gained prominence due to their ability to process large datasets and identify complex patterns indicative of theft. Techniques such as SVM, NN, and ensemble methods like XGBoost are extensively used. These algorithms significantly reduce FPR and enhance detection rates (DR) by leveraging extensive historical data. For example the XGBoost algorithm has demonstrated superior performance in distinguishing between normal and malicious electricity usage, achieving high accuracy and low FPR (Abdallah and Shen, 2015; Faheem et al., 2019; Gul et al., 2020).
Recent research has increasingly focused on addressing dataset imbalances and enhancing feature extraction techniques to further improve ETD models. Synthetic data generation methods like ADASYN have been employed to create balanced datasets, enhancing the accuracy of detection models. Advanced algorithms such as LightGBM and CatBoost have also been explored for their effectiveness in ETD, showing high accuracy and low FPR (Abdallah and Shen, 2015; Ahmad et al., 2018; Amin et al., 2015; Irfan et al., 2022; Muniz et al., 2009).
A notable study by Chen et al. utilized XGBoost for classification, achieving high accuracy rates and low false-positive occurrences by balancing the dataset with artificially created theft instances. Similarly, Kawoosa et al. employed consumer consumption behavior to identify anomalies, demonstrating that XGBoost outperformed other algorithms like LightGBM and CatBoost in terms of accuracy and FPR. These studies underscore the potential of ensemble methods to significantly enhance the performance of ETD models (Leite and Mantovani, 2016; Wang and Chen, 2019).
Further advancements include hybrid methods that integrate data from smart meters with auxiliary databases such as weather conditions, tariff details, and geographic information. This comprehensive approach improves the accuracy of theft detection models by considering external factors influencing electricity consumption. For example, combining smart meter data with auxiliary datasets has been shown to enhance the model's ability to detect anomalous consumption patterns and reduce FPR (Chen and Guestrin, 2016).
Ensemble methods, which combine multiple machine learning algorithms, have shown particular promise in ETD. For instance, combining logistic regression (LR) and random forest (RF) models using stacking techniques has achieved superior results. A study demonstrated that stacking LR with RF models, along with ADASYN for addressing data imbalance, resulted in 94% accuracy, recall, and F1-score in detecting electricity theft. This approach highlights the effectiveness of ensemble methods in handling challenging imbalanced datasets (Punmiya and Choe, 2021; Qu et al., 2021).
Saqib et al. have presented a compact deep-learning model for detecting electricity theft in areas without smart grids, using monthly customer readings and feature engineering techniques like PCA, t-SNE, and resampling methods (RUS, SMOTE, ROS). The model outperforms existing methods in non-smart grid environments. However, deep learning models are resource-intensive, which could limit their deployment in resource-constrained settings, and the reliance on monthly data reduces real-time applicability (Saqib et al., 2024). This research examines how to employ complicated feature engineering and resampling strategies to enhance the efficiency of ETD models.
Research gap
In light of the advancements and challenges discussed in Sections “Context” and “Current research,” several key research gaps have been identified that justify the contributions outlined in Section “Scientific contribution”:
These research gaps demonstrate the pressing need for innovative solutions that address these deficiencies. The proposed ensemble-based model, incorporating advanced feature engineering techniques and designed for resource-limited environments, aims to bridge these gaps. By enhancing accuracy, scalability, and integration with existing systems, the contributions detailed in Section “Scientific contribution” are justified as significant advancements in the field of electricity theft detection.
Proposed methodology
Given that the dataset contains a large number of independent variables, the proposed model uses an ensemble technique combining RF and LR for electricity theft prediction. Choosing RF and LR for an ensembling technique enhances the attributes of each algorithm and leads to better model performance. RF, a combination of decision trees, stands out for its ability to correctly model interactions for nonlinear processes and work with high-dimensional data with the help of a great bagging approach. It seems that this diversity among the trees makes the system less prone to overfitting and increases its ability to generalize. LR, on the other hand, provides a more straightforward and easily interpretable structure albeit able to capture linear separability, and can work even where fewer features are present. Specifically, the RF can be used to address such conditions as nonlinearity and overcomplication in the relationship between predictors and the outcome while the LR model is very effective in handling linear separability. They usually make the predictive model more balanced and accurate because they portray the strengths that each of the techniques may offer. The procedure for the proposed work is shown in Figure 1.

Proposed work based on ensemble methods.
The stacking ensemble uses two complementary base learners—RF and LR—to align with the deployment objective of lightweight, transparent models suitable for resource-limited environments. RF captures nonlinear interactions and reduces overfitting through bagging, whereas LR provides a stable, interpretable linear component. On the ADASYN-balanced data, this pairing yielded ≈94% accuracy, recall, and F1-score, while avoiding the added training and inference cost of introducing additional base learners.
Voting technique
In ensemble learning, the voting technique integrates multiple models to improve the general accuracy of the baseline model by utilizing the advantages of each model. In voting, different models like decision trees, SVM, NN, and many more are used and trained in the same data set. While making a prediction, all models contribute a vote for the class output, and the overall prediction is made based on the votes. There are two common types of voting: this includes hard voting, where a class with the most votes is picked, and soft voting, where a class with the highest mean of predicted probabilities is chosen. This approach minimizes the issue of overfitting and enhances reliability by averaging out the errors of stochastically diverse models for enhanced reliability in predictions.
Stacking technique
Stacking, also known as stacked generalization, is a technique in ensemble learning where several models are integrated to enhance the overall prognosis capacity. Stacking at first trains a set of base models on the original dataset. These base models can be different, for example decision trees, SVM, or NN. These base model predictions are then utilized as feature inputs into another model known as the meta-learner or stacker which is aimed at arriving at a final prediction. The meta-learner learns how best to combine the base models outputs with possible improvement on both accuracy and generalization as compared to any single base model. Stacking is equally powerful because it aggregates multiple models’ view to provide an ability to capture subtle patterns in the data with more effectiveness.
Dataset description
The dataset, obtained from Kaggle (2025), includes consumption records for 42,373 customers, tracking their electricity usage over 1034 days between 2014 and 2016. This data originates from the daily electricity consumption records provided by the Power Grid Corporation of China. The corporation has been operational since December 29, 2002, serving over 1.1 billion people across 88% of the country (Petrlik et al., 2022). Each record features a “Flag” column indicating electricity theft, with values of 0 for non-involvement and 1 for involvement. Missing values in each column are filled with mean values. Figure 2 illustrates that approximately 8.5% of the dataset consists of “Theft” (1) instances, while 91.5% are “Not-Theft” (0) instances.

Class imbalance in the dataset with percentage distribution.
All columns are depicted in Table 1, with “N/A” representing missing values. In total, there are 1034 records in the dataset. To provide an overview, the first five and last five records in Table 1 are displayed.
Missing values in features.
After filling missing values using the mean function, the consumption data of all customers involved in electricity theft and not involved in electricity theft is depicted in Figure 3 for the period from 2014 to 2016. For interpretability, monthly mean electricity consumption is normalized to the 2014 to 2016 grand mean, set to 100%; values above 100% indicate months with above-average consumption, and values below 100% indicate months with below-average consumption.

Normalized monthly mean electricity consumption (% of 2014–2016 grand mean) for theft and non-theft classes (2014–2016).
Figure 3 clearly highlights the class imbalance, with theft cases significantly fewer than non-theft. Such imbalance can bias the model toward the majority class, reducing its ability to detect theft accurately. As observed in Zhou et al. (2025), balancing the classes improves model performance by allowing it to learn from both classes more effectively, resulting in better precision, recall, and overall reliability. The dataset used in this study, along with the predicted output values generated by the optimal ensemble-stacking model, is available from the corresponding author upon reasonable request to facilitate reproducibility and further research.
Recent studies have demonstrated effective approaches to class imbalance, feature engineering, and load identification that are highly relevant to electricity theft detection. For instance, the HIDIM framework addresses hierarchical dependencies and imbalance in network intrusion detection through protocol-aware embeddings and advanced oversampling techniques (Zhou et al., 2025). In the domain of non-intrusive appliance load monitoring, multiscale spatio-temporal feature fusion has proven successful for handling variable load patterns across industrial sectors (Lin et al., 2024a). Additionally, CatBoost classifiers optimized with entropy-based feature sets and enhanced via Borderline-SMOTE have shown robust performance in identifying imbalanced industrial load data (Lin et al., 2024b). These studies support the design of our own ensemble approach that combines dimensionality reduction, resampling, and stacking methods to handle data imbalance and improve classification accuracy in theft detection scenarios.
Choosing ADASYN (adaptive synthetic sampling)
ADASYN addresses class imbalance by generating synthetic samples for minority classes based on their proximity to the dominant class (Fernández et al., 2018; He et al., 2008). This adaptive approach focuses on harder-to-learn instances, thereby enhancing learning in complex cases, improving model performance, and reducing errors caused by class dominance.
Impact of class imbalance on model performance
The phenomenon of class imbalance in datasets results from major differences in class distribution because it produces challenges for machine learning models. Data sets with significant imbalances cause models to predict predominantly the dominant class category resulting in poor results for minority class observations. Data skewed toward the minority class in essential classification problems like fraud detection or electricity theft forecasting results in significant numbers of undetected minority class samples. The model asserts a performance deficit while identifying minority class occurrences because of this bias which decreases overall accuracy and reliability along with prediction fairness.
As shown in Figure 4, the use of ADASYN has successfully addressed the class imbalance issue, resulting in a more balanced dataset with an equal representation of classes.

Actual and ADASYN class distribution.
How ADASYN balances the dataset?
The ADASYN method provides an efficient solution to handle class imbalance by generating synthetic samples from the minority class data. ADASYN differs from common random oversampling because it produces new data points that target underlearned regions around the decision boundary to overcome minority class scarcity. The synthesized samples created by ADASYN improve classification of minority class instances because their placement enhances model learning where it normally struggles to separate classes. ADASYN produces synthetic samples to enhance minority class representation thus the model learns more effectively from less frequent examples while eliminating majority-class bias.
Advantages of ADASYN
Many advantages derive from ADASYN which both balances class distribution and increases model execution capabilities. The main positive aspect of ADASYN enables models to avoid mainly focusing on majority instances and thus become better at recognizing minority class cases. Through its synthetic sampling technique ADASYN produces new data points which mirror minority instances in their original form thus creating an improved model that validates and forecasts the minority class with greater accuracy. The application of this technique benefits imbalanced datasets strongly because it enables better detection of minority class patterns.
Disadvantages of ADASYN
The deployment of ADASYN system leads to multiple positive effects yet produces several negative aspects. The main drawback of ADASYN occurs because it may result in model overfitting. The synthetic data generated by ADASYN maintains close ties with the available minority class examples which could create an overly specific model with diminished generalization capabilities for unseen observations. The synthetic samples created by this method can fail to depict the genuine class diversity because the density estimation algorithm might incorrectly represent the minority class statistics. Synthesizing large datasets using this method requires substantial computational processing time which becomes a disadvantage during executions.
Mitigating the disadvantages in this study
To mitigate these disadvantages, this study employed several techniques. Overfitting was minimized by carefully tuning the parameters of ADASYN to prevent the generation of an excessive number of synthetic samples in areas that are already well-represented in the minority class. The quality of the synthetic samples was also carefully monitored to ensure that they closely resembled real instances of the minority class, reducing the potential for introducing noise into the dataset. Additionally, the computational cost was managed by applying ADASYN selectively and integrating it with other resampling techniques, ensuring that the balance between class distribution and model performance was optimized without overwhelming the computational resources. These efforts improved the predictions by enabling the model to learn more effectively from a balanced dataset while avoiding the common pitfalls associated with synthetic data generation.
PCA configuration and impact on model performance
In this study, PCA was employed as a dimensionality reduction technique to improve model interpretability and reduce noise in the high-dimensional feature space. After analyzing the explained variance ratio, the top 100 components are retained, which together accounted for over 95% of the total variance in the dataset. This transformation not only reduced computational overhead but also helped mitigate multicollinearity among features. During evaluation, models trained on PCA-transformed features demonstrated a marginal improvement in precision and recall, particularly in the balanced datasets, indicating enhanced learning capability with more informative and compact representations.
Results and discussions
These are obtained using the proposed method on the imbalanced as well as balanced datasets. The performance of the model is then assessed and alternatives such as precision, recall, F1-score, mean absolute error (MAE), and the ROC curve are applied. This detailed comparison enables providing deep insights into the advantages and disadvantages of the proposed approach in terms of balancing the data and its influence on the model's performance.
Results based on imbalanced data on voting and stacking techniques
Around 80% of the imbalanced data in the training set was utilized to train the model, incorporating both the dependent variable (the outcome identifier) and the independent variables (the input factors). Another 20% subset of the dataset was reserved as the test set for model evaluation. These features are considered during the implementation with SKlearn. A detailed description of the formats for the training and testing datasets is provided in Table 1.
Using the voting method on imbalanced data, the proposed work achieved an accuracy of 91%. However, the precision, recall, and F1-score for theft class were 63%, 11%, and 19%, respectively, as shown in Table 2. This disparity highlights a significant issue. There is a need to address this gap, as 85 records were correctly identified as theft-related, while 673 records were incorrectly predicted.
Dimensions of training and test sets from imbalanced dataset.
Using the stacking method on imbalanced data, the proposed work achieved an accuracy of 92%. However, the precision, recall, and F1-score for theft class were 66%, 14%, and 23%, respectively, as shown in Table 3. This disparity highlights a significant issue. There is a need to address this gap, as 256 records were correctly identified as theft-related, while 1592 records were incorrectly predicted.
Measures from voting technique using imbalanced data.
Results based on balanced data on voting and stacking techniques
The ADASYN approach was initially used to balance the dataset, as was indicated in Section “Choosing ADASYN (adaptive synthetic sampling).” Reducing all values to a range between zero and one to prepare a dataset for analysis is known as “normalization.” The model has an easier time understanding the dataset because of this homogeneity. The dataset will be standardized by eliminating the mean and scaling it to unit variance when it is converted to StandardScaler format. Due to the massive size of the dataset, which consists of 62,356 rows with 1034 columns in the Training Set and 15,590 rows with 1034 columns in the Test Set, it is currently not possible to display the entire dataset. Subsequently, the dataset was divided into two parts, allocating 80% for training and 20% for testing. Throughout, the data are examined to ensure that their mean is zero and their standard deviation is one. Samples of the first five records processed with StandardScaler for training are displayed in Table 4, and samples of the data for testing are displayed in Table 5.
Sample of training set from 80% balanced data.
Sample of test set from 20% balanced data.
Voting results
The training data from Table 6 was used to train an ensemble model with the voting technique. The test data from Table 4 was then evaluated on the trained model, achieving not only 93% accuracy but also significantly high precision, recall, and F1-score, which were 91%, 96%, and 93%, respectively. The model correctly predicted 7457 records and incorrectly predicted only 336 records in the theft class. Additional performance measures are presented in Table 5.
Measures from stacking technique using imbalanced data.
Measures from voting technique using balanced data.
Figure 5 shows that in the model, there are fewer numbers of false positives for theft and non-theft cases than the true positives. This shows that the model has a very high level of efficiency and ability to differentiate between cases of electricity theft and non-theft. To further support the model, the visualization emphasizes the highly increased number of true positives and true negatives as compared to the low number of false positives and false negatives. This performance is important for practical uses so that any errors are kept to a minimum and as many correct identifications as possible are given for decision-making and resource allocation in preventing electricity theft. Table 7 shows the balanced data using voting teachnique.

Theft and not-theft classes using voting method on balanced data.
Stacking results
The stacking ensemble model was evaluated on the balanced dataset described in Section “Results based on balanced data on voting and stacking techniques.” It achieved an overall accuracy of 94%, with precision, recall, and F1-score all reaching 94% or higher, as shown in Table 8.
Measures from stacking technique using balanced data.
As illustrated in Figure 6, the model demonstrates a high level of discrimination between theft and non-theft instances, with a notable reduction in false predictions. This supports the stacking ensemble's robustness in delivering balanced and reliable classification performance in electricity theft detection.

Theft and not-theft classes using stacking method on balanced data.
Comparative analysis
In Figure 7, the confusion matrix representations have been derived and depicted for the voting and stacking ensemble methods tested on both balanced and imbalanced datasets. It is obvious from the results that stacking over the balanced data is higher than any other settings. Particularly, the confusion matrix of the stacking method with the balanced data presentation, further indicates more correctly predicted true-positive and true-negative instances in contrast to the other method. By comparing these two, the importance of balancing the dataset before using the stacking technique is evident, as it would help improve the precision and reliability of the model's prediction. It can be also seen from the above visualization that data preprocessing plays a crucial role in improving the model's outcome and it is quite clear that stacking and balanced data gave the best outcome.

Confusion matrix for voting and stacking on imbalanced and balanced data.
Error analysis (stacking on balanced data). For the best-performing configuration, the confusion-matrix counts are TN = 18,398; FP = 1015; FN = 1171; TP = 18,389 (Table 9). These counts align with the class-wise metrics in Table 8 (theft precision = 0.95, recall = 0.94, F1 = 0.94), and indicate a balanced error profile: false positives (unnecessary inspections) and false negatives (missed theft) are of comparable magnitude. In practical deployment, this balance is advantageous because operating thresholds can be selected according to local priorities (e.g. reducing FN to limit revenue loss versus reducing FP to limit field-visit costs) without materially degrading overall performance.
True and false prediction voting and stacking.
Precision, recall and F1-score for all methods
Figure 8 illustrates the precision, recall, and F1-score for both voting and stacking techniques applied to balanced and imbalanced data. The graphs clearly show that stacking on balanced data achieves the highest values across all metrics, indicating that, in terms of precision, recall, and F1-score, stacking on balanced data provides the best performance. This underscores the advantage of using the stacking method with balanced datasets for optimal predictive accuracy and reliability.

Precision, recall and F1-score for all models.
ROC curve
The receiver operating characteristic (ROC) curve illustrates the diagnostic ability of a binary classifier by plotting the true positive rate (sensitivity) against the false positive rate (1 − specificity) at various threshold settings. The area under the curve (AUC) quantifies this performance—an AUC of 1.0 represents a perfect classifier, while 0.5 indicates no better than random guessing (Fawcett, 2006).
ROC analysis is especially useful in imbalanced classification tasks, as it evaluates the model's ability to discriminate between classes across different thresholds. As shown in Figure 9, both the voting and stacking models performed well when trained on balanced data, achieving AUC scores of 0.93 and 0.94, respectively. These results highlight the stacking method's superior classification capability in electricity theft detection.

ROC curves for all methods.
Mean absolute error
MAE is a widely used metric for evaluating regression and probabilistic classification models. It calculates the average of the absolute differences between predicted and actual values, making it a simple yet informative measure of predictive accuracy (Willmott and Matsuura, 2005). Unlike RMSE, MAE does not disproportionately penalize large errors, making it particularly useful when outliers are not of primary concern.
As depicted in Figure 10, the stacking model on balanced data achieved the lowest MAE (0.0561) among all configurations, followed by voting on balanced data (0.0674). In contrast, stacking and voting on imbalanced data yielded higher MAE values (0.0815 and 0.0852, respectively). These findings further validate the effectiveness of balancing and ensemble stacking in minimizing prediction errors.

MAE representation for all methods.
True and false prediction voting and stacking
Table 9 presents the confusion matrix values: True Not-Theft, False Theft, False Not-Theft, and True Theft for four different models used in electricity theft detection: Voting Imbalanced, Voting Balanced, Stacking Imbalanced, and Stacking Balanced. In the case of the Voting Imbalanced model, the classifier performs well in correctly identifying non-theft cases, achieving 7662 True Not-Theft predictions, while generating only 55 False Theft predictions. However, it misses 673 actual theft cases (False Not-Theft) and correctly identifies only 85 theft cases (True Theft), indicating a bias toward the majority class due to data imbalance. In contrast, the Voting Balanced model significantly improves theft detection, yielding 7457 True Theft predictions while reducing False Not-Theft cases to 336. However, this improvement comes with a trade-off, as the number of False Theft cases increases to 715 and True Not-Theft cases decrease to 7082. This suggests a better balance but slightly lower performance in correctly identifying non-theft cases. The Stacking Imbalanced model, similar to the Voting Imbalanced one, is heavily skewed toward predicting non-theft. It shows a very high number of True Not-Theft cases at 19,204 and a low number of True Theft detections at 256. It also records 1592 missed theft cases and only 134 false alarms, reinforcing that imbalance limits the model's ability to effectively capture theft events. The Stacking Balanced model demonstrates strong performance on both classes, with 18,398 True Not-Theft predictions and 18,389 True Theft detections. Although it incurs 1015 False Theft and 1171 False Not-Theft predictions, this model exhibits a balanced and effective capability in identifying both theft and non-theft cases. Overall, balanced models, especially Stacking Balanced, show significant improvement in theft detection accuracy compared to their imbalanced counterparts.
Decision
The major theme of Figure 11 is the visual comparison between actual and predicted values of electricity theft (0 or 1) for the first 100 data points. The figure illustrates how well the model aligns with real instances of theft detection using two distinct markers: yellow circles for actual values and red crosses for predicted ones. While the predictions generally follow the actual pattern, a few mismatches are observable. These differences highlight areas of false positives and false negatives. Despite a labeling error on the

First 100 actual and predicted values of electricity theft 1 or 0.
Based on a comprehensive evaluation using performance metrics such as precision, recall, F1-score, ROC curve analysis, and MAE, it is concluded that the ensemble-stacking model demonstrates superior performance and robustness. Therefore, it is identified as the most appropriate and reliable approach for the proposed electricity theft detection framework.
Statistical results on ensemble-stacking
As the ensemble-stacking model yielded the best performance among all evaluated models, a detailed statistical analysis was conducted to further validate its effectiveness. The results, presented in Table 10, demonstrate the model's robust classification capability and generalization power. The Matthews correlation coefficient (MCC) and Cohen's Kappa, with values of 0.8879 and 0.8878, respectively, indicate a strong agreement between predicted and actual classes, even in the presence of class imbalance. A high ROC AUC score of 0.9439 reflects excellent discriminative power, confirming the model's ability to distinguish between theft and non-theft cases effectively.
Statistical results on ensemble-stacking.
Moreover, the Hamming loss and mean squared error (MSE) are both low (0.0561), suggesting that the proportion of incorrect predictions is minimal and that the predicted values are close to the actual ones. The
Radar chart for ensemble-stacking
The radar chart presented in Figure 12 offers a concise visual summary of the ensemble-stacking model's performance across five key evaluation metrics: accuracy, precision, recall, F1-score, and ROC-AUC. To better illustrate the relative differences between these metrics, the chart uses a restricted scale range (0.85–1.00), which avoids visual flattening and enhances interpretability.

Radar chart illustrating the performance of the ensemble-stacking model across five key metrics. The adjusted scale (0.85–1.00) emphasizes the relative strengths and differences between accuracy, precision, recall, F1-score, and ROC-AUC. All metrics show consistently high performance, with precision being the highest at 0.95.
The resulting profile reveals a highly balanced performance across all dimensions. The model achieves strong precision (0.95), recall (0.94), and F1-score (0.94), indicating its ability to minimize both false positives and false negatives. The ROC-AUC value of 0.9439 further confirms the model's excellent discrimination between classes, particularly in the presence of imbalanced data.
By highlighting the consistent strength of the model across all evaluated metrics, this radar chart reinforces the robustness and reliability of the proposed ensemble-stacking approach. While additional comparative studies may further validate its superiority in specific contexts, the visual evidence strongly supports its effectiveness as a high-performing classification model for electricity theft detection.
Computational efficiency and deployment: The stacked model uses a single RF and a LR as base learners with a linear meta-learner. Inference cost is dominated by RF tree traversals plus a small number of linear operations from LR and the meta-learner. Memory scales with the number of trees and retained PCA features; PCA reduces input dimensionality before ensemble training and inference. This design avoids deep architectures and large parameter counts, supporting low-latency prediction on commodity CPUs in resource-constrained settings without smart-grid infrastructure. No additional experiments were required; this paragraph summarizes the algorithmic costs and implementation choices underlying the reported results.
Comparison with similar studies
Based on a comprehensive comparative analysis of precision, recall, F1-score, accuracy, ROC curve, and MAE values, it is concluded that the ensemble-stacking technique is the best model for the proposed work on electricity theft detection. This model not only achieved superior results compared to the Voting-based method but also outperformed all models from the existing literature, as summarized in Table 11. With a consistent performance of 94% across all key classification metrics, the ensemble-stacking method demonstrates robustness and high predictive accuracy, particularly in handling imbalanced datasets.
Comparison with similar studies.
The comparative baselines referenced in Table 11 include pipeline-based ML (Anwar et al., 2020), a hybrid approach (Haq et al., 2021), LightGBM (Oprea and Bâra, 2021), RF with SMOTE and SVM with SMOTE (Petrlik et al., 2022), MLP–GRU (Iftikhar et al., 2024), and ensemble + prototype learning (Sun et al., 2023).
Conclusion
Recapitulation
A present work unveils the birth of a creative ensemble-based model that fuses LR and RF algorithms to tackle the issue of electricity theft detection—with votes and stacking as techniques, the proposed model promises an improved level of accuracy and reliability. The incorporation of ADASYN comes into play here: it deals effectively with imbalanced datasets (a common issue in electricity theft detection), ensuring the stack ensemble method coupled with high-level feature engineering does not compromise on performance. On the balanced dataset, the stacking ensemble achieved an overall accuracy of 0.94 with class-wise metrics of precision/recall/F1 equal to 0.95/0.94/0.94 for the theft class and 0.94/0.95/0.94 for the non-theft class. Receiver-operating characteristics further indicate strong discrimination (AUC = 0.94) for stacking, exceeding the voting ensemble on balanced data (AUC = 0.93). Confusion-matrix results for stacking on balanced data show 18,389 true-theft and 18,398 true-non-theft predictions, with 1015 false-theft and 1171 false-non-theft. In contrast, training on imbalanced data produced very low recall for the theft class (0.11 with voting; 0.14 with stacking), underscoring the necessity of class balancing. Across error profiles, stacking on balanced data also yielded the lowest MAE (0.0561), compared with 0.0674 for voting on balanced data and ≥0.0815 for imbalanced settings. The latter boasts some impressive statistics; including but not limited to 94% accuracy recall and F1-score even after taking a lightweight and computationally efficient design which favors deployment in resource-constrained settings, thus offering utility providers (particularly in developing countries) a practical scalable solution. This research enhances the detection ability by responding to the established research gaps. However, it should be noted that this study also leads to stability and long-term sustenance in the field of electricity distribution networks. The results underscore the need for continual innovation plus elaborate validation in the sector—efforts heralding effective and reliable systems meant for the detection of electricity theft.
Future work
While the proposed ensemble-based model for electricity theft detection has demonstrated significant improvements in accuracy and reliability, several avenues for future research remain. Firstly, there is a need to explore the integration of additional machine learning algorithms into the ensemble framework to further enhance detection capabilities. Advanced algorithms such as deep NN and gradient boosting methods could be incorporated to capture more complex patterns in the data. Secondly, expanding the feature set to include external factors such as weather conditions, economic indicators, and social factors could provide a more holistic view of electricity consumption patterns and improve the model's predictive power. Additionally, future work should focus on real-world validation of the model in diverse geographical and operational contexts to ensure its robustness and scalability. This includes field testing in different regions with varying levels of infrastructure and consumption behaviors. Finally, the development of user-friendly interfaces and deployment strategies for utility companies, particularly in resource-constrained environments, will be crucial for practical implementation and widespread adoption of the proposed solution.
The proposed detection system shows promising results on its dedicated electricity theft dataset but the researchers need to validate its ability to work with diverse electric consumption scenarios from different geographic locations and economic settings. The electricity consumption behavior differs extensively from one region to another because of their unique infrastructure setups coupled with consumer behaviors along with economic conditions. Regions with better-developed metering systems and stronger regulatory structures have distinctive theft patterns from areas with basic infrastructural capabilities. The model performance can be influenced by its capability to handle such diverse operational conditions. The model needs future testing across multiple regions to validate its operational capability with varying consumer activities. The assessment of ensemble performance with ADASYN for class imbalance should include diverse environments to reveal universal applicability for real-world electricity theft detection.
