Abstract
Keywords
Introduction
The world is encountering growing urbanization while major cities have become a driver for economic growth. Consequently, urban areas transform into smart cities, where decisions are made by self-organized and autonomous networks. Several existing smart infrastructures have been developed by multiple integrated wireless sensor networks (WSNs), in which sensors are responsible for acquiring data from real-world environments. One critical challenge in this domain is the limited energy that is available for every sensor, where batteries are the main source of power. Such problem directly affects the lifetime of such integrated networks and, thus, their sustainability.
Figure 1 shows an example of multiple integrated and heterogeneous WSNs in a smart environment, in which a cloud infrastructure controls such integration. In this platform, real-time processing is an essential operation in which cloud services respond immediately to requests. However, integration among multiple networks added more complexity in maintaining power consumption (load balancing). The problem arises when sensor nodes have different energy utilities and power consumption strategies. For instance, the system might fail to provide its desired functionality if some sensors are running out of power. Therefore, reliability and availability are significant attributes in designing such systems, in which sensory nodes are distributed over a large area. 1

Integrated and heterogeneous WSN in cloud infrastructure.
This article investigates the application of a modified version of multinomial logistic regression (MLR) for the purpose of using spatiotemporal characteristics of a given environment to develop a dynamic load balancing strategy, in which a prediction model is applied to estimate the best set of sensors that could be used as a substitution to expand the lifetime of such networks; it models the nominal variables in which the outcomes are depicted as a linear combination of the predictor variables.
Since the problem of distributing energy load is a multiclass one (as it is dependent on the type of sensor, time, location, and context), generalizing the logistic regression model by predicting the discrete outcomes (sensors IDs) would enhance the development of the load balancing strategy. Given a set of independent variables
The contribution of this research is to redefine the feature set
The rest of this article is organized as follows: section “Literature review” discusses and compares the related research with the proposed one. Section “Background” provides a formal description of preliminary definitions and equations that provide the baseline for the proposed methodology. Section “Space parameterization” describes our formal contribution in incorporating spatial distribution of sensors in the MLR model. Section “Dynamic load balancing” describes the proposed algorithm for dynamic load balancing strategy. Section “Modeling the threshold value” discusses our methodology in computing rounded threshold value. Section “Experiments and result” describes and discusses the experiments that have been conducted to evaluate the output in terms of error rate and energy consumption. Section “Conclusion and future work” summarizes the conclusion and the suggested future enhancements of this research.
Literature review
There are many energy-aware approaches proposed for solving the problem of limited energy in WSNs. According to Soua and Minet, 2 prediction-based energy-efficient techniques are classified into three categories based on (1) data level such as reducing data production, transmission, and processing; (2) routing protocols by choosing shortest paths; and (3) overhead reduction. Since our contribution focuses on the data-level techniques, the rest of this section illustrates and compares only data-level techniques.
Recent research focused on developing optimal communication techniques at data level to minimize the number of communicating messages, which would reduce the energy consumption rate.3–7 Contextually, the consumption rate depends heavily on two sources: computation and transmission. Specifically, the amount of energy that is required for computation is defined as the amount of electrical power that is required by a node (sensor or central node) to execute a set of instructions. 8 Although computation is an important source, previous research on similar applications have proven that this part could be ignored as the amount of energy that is required by transmitting data (function of packet size and distance among nodes) is always significantly larger than the amount required by computation. 9 Our proposed methodology adopts such assumptions and expands the application of such techniques to resolve the problem in a heterogeneous environment.
Furthermore, few researchers10–13 have proposed prediction techniques to reduce data transmission to reduce energy consumption. In Lu et al. 14 and Wei et al., 15 data aggregation has been applied as a basis for predicting data in energy-efficient frameworks. Moreover, Kerasiotis et al. 16 have applied a prediction model to justify the battery consumption in residual energy. Unfortunately, these efficient techniques were not designed to handle heterogeneous data that have been generated from different WSNs.
Carvalho et al. 17 considered multivariate spatiocorrelation in their work, using multiple linear regression, to reduce data transmission and to enhance prediction accuracy. While our proposed technique uses MLR to find the alternative sensor within the same cluster, in which these clusters are based on Euclidean distance.
An energy-efficient solution for WSNs to predict the missing values of sensors according to dispersion of those sensors (deviation-based estimation framework (DEF)) was proposed by Zamil et al. 8 Depending on the distance and the time to generate accurate values, this technique leads to minimize the power consumption with the help of neighbors’ data and historical data. Since cloud integrated environments allow for multiple sources of data, complex data transformations are required. We resolve such complexity by normalizing data variables, at abstract level rather than data level.
Research by Aderohunmu et al. 18 and Stojkoska et al. 19 discussed a data reduction strategy to conserve power consumption using a naive prediction algorithm that is light in terms of computational overhead on a node (limited). The predicted values rely on the last observed value. The model basically relies on the linearity of the data sample and also the variance where it should not be significantly larger than zero. We adopt similar technique in modeling independent linear variables.
The scheduling algorithms that have been used to schedule tasks in hierarchical environments, such as cloud and grid computing, did not take the spatio features into consideration. Scheduling algorithms such as min–min, max–min, and duplex (which is a combination of the last two algorithms) focused specifically on the execution time or the completion time. 20 The traditional heuristic min–min and max–min algorithms that have been proposed by Ibarra and Kim 21 focused on finding the smallest finishing time, where min–min first chose the unassigned task with the minimum execution time and the same for the next unassigned task. However, max–min algorithm chooses first the unassigned task with the minimum time, but chooses the next unassigned task with the maximum execution time. 22
Min–min does not consider space or workload. A previous work by Chen et al. 23 has enhanced min–min to consider workload, while Liu et al. 24 took the quality of service, dynamic priority model, and cost service into consideration, but not the space. Another research has combined min–min and max–min to overcome the obstacles of each algorithm when it is applied individually, but without considering the spatiotemporal issue. 25 Our contribution in this research resolves this problem by redefining the min–max tradition algorithm to implicitly consider the spatiotemporal aspects of data tuples.
Background
Let
The probability that
And
The parameter
Since the environment of integrated networks also has independent variables, such as time and location of requests and responses, we define the vector
In a problem where multiple categories (sensors) are required to be predicted with a given set of independent features (spatiotemporal characteristics), the probability function for
And
Notice that when
Finally, to estimate a choice from a given data (historical data), a likelihood function must be defined based on the fact that
Space parameterization
Regression analysis is a statistical approach for capturing relationships among variables (features). Deciding which regression model to use depends on three factors: number of the independent variables, the shape of the regression line, and the type of the dependent variable. The most used regression models are linear regression, logistic regression, polynomial regression, and stepwise regression.26,27
Linear regression is used when the shape of the regression line is linear (i.e.
Incorporating spatial aspects of data into a multinomial regression model required grouping data into a meaningful pool. One of the well-known methods is the
Suppose
There are mainly two general equations that are required to apply MLR formulas. 27 For each category, there is an equation except for the reference category. The categories refer to the class-labels that the MLR will predict. In our case, each cluster has different categories (class-labels), the members of the clusters. Thus, the number of equations equals the number of cluster members −1
where
ln(
where
Furthermore, equation (12) can be simplified as follows
Equation (13) shows our contribution in incorporating the spatial characteristics of a cluster of sensors into logistic regression. Restricting the independent variables to the set of values
Dynamic load balancing
According to the operational definitions that have been mentioned previously, Figure 2 demonstrates the proposed dynamic load balancing algorithm.

Dynamic load balancing based on the proposed prediction model.
Our proposed load balancing algorithm takes eight parameters as input. The number of tasks
The algorithmic description below illustrates the sequence of instructions to decide on the set of alternative sensors to substitute a given one. The outer loop at line 2 is responsible for analyzing tasks. First, it checks whether the task is a subtask of its predecessor or not (lines 3–9). If the condition is satisfied, then the large task will be picked; otherwise, the shortest will be picked.
The algorithm, then, computes the remaining power when the chosen task is executed and computes the current threshold value (lines 10 and 11). If the cutoff value is greater than the remaining power, an alternative sensor will be checked to choose (lines 13–17). Otherwise, the task will be assigned to its designated sensor.
The number of tasks is always greater than the number of sensors during operational time (convergence), that is,
Modeling the threshold value
Let
By ignoring the computational part, consequently, the amount of energy consumed by a specific task is given by equation (16)
where
To recognize how much power a sensor has consumed, we computed the power consumed by each task that a sensor has performed. The summation of the power consumed by every single task will give us the overall power consumed by a sensor, given by equation (17)
where
A main feature that must be defined, in addition to the mission time, is the initial battery capacity (source of power), which is the value that the manufacturing company of the battery will provide. The battery capacity, after subtracting the power consumed by just operating a sensor without assigning any task to it along the whole mission time, is given by equation (18). The remaining power in the battery can be computed by subtracting the power consumed in a sensor from the battery capacity in that sensor as shown in equation (19)
where
Another important feature that must be defined is the threshold of energy consumption by a task, as a sensor will be assigned multiple tasks; the threshold will be equal to the energy consumed by the shortest task. Thus, each round the value of the threshold will be different than the value in the previous one.
To prevent any sudden shutdown that will not achieve network coverage, a dynamic threshold to choose the suitable alternative sensor is demonstrated by equation (20). Therefore, if the remaining power for a specific alternative sensor is greater than the threshold, it will be chosen to perform the task. Otherwise, it will not receive tasks for the current round. Such operation is important to choose the tasks that would be scheduled to each sensor
where
Experiments and result
The experiments have been performed using IBM SPSS and WEKA tools to address the data mining tasks, such as preprocessing and clustering. Experimental results are twofold: error rate and energy consumption. The error rate indicates the performance of the prediction technique and its ability to estimate values. While the second experiments show the effectiveness of the proposed strategy in minimizing the energy consumption rate as compared to traditional technique.
Datasets
To test the proposed methodology, two datasets have been exercised: Data Sensing Lab 29 and Intel Data Lab dataset (available at: http://db.csail.mit.edu/labdata/labdata.html). Both datasets have been collected using integrated WSNs. Table 1 shows a description of each dataset. The Data Sensing Lab dataset has two parts: second and third floor sub datasets. Since both of them have different distribution, we considered them as separate datasets. Thus, the experiments have been performed on three datasets.
Dataset description.
Results and discussion
This section aims to clarify the set of experiments that have been performed for the purpose of measuring the effectiveness of the proposed model. In other words, we provide evidence to test mainly two hypothesizes; the proposed model achieves low error rate in terms of prediction task and the proposed model minimizes the power consumption rate. Before measuring the error and the power consumption rates, we will briefly illustrate the application of our methodology on the datasets.
First, we performed the best
Figure 3 demonstrates cluster memberships; members are the sensors that are deployed in the second floor of Data Sensing Lab, and it shows to which cluster they belong, the distance between each member, and its cluster center.

Cluster memberships:
The same calculations have been performed for the other datasets: third floor and Intel Data Lab datasets. For each dataset, the best
Next, we simulated the environment using our proposed logistic model. Such simulation provided us the opportunity to apply our algorithmic strategy: (1) measuring power consumption using normal flow, (2) predicting alternative load distribution, and (3) measuring power consumption after applying the proposed load balancing strategy.
Figure 4 shows a snapshot from the results of applying our methodology on the first dataset. This snapshot illustrates an example of four groups (4, 7, 9, and 35). The independent variable is the sensor_id, which means we have to calculate three log odds (or

Snapshot of parameter estimate for the multinomial logistic regression.
The second column in Figure 4 shows the intercepts, coefficient values for the reference category 35 (in case other coefficients were set to zero), and
After collecting results of applying the prediction model on all datasets, we compared the efficiency of the proposed methodology by computing the error rate resulted from the prediction task and the amount of saved power consumption after applying our methodology. The following two sections show the results and discuss them in detail.
Error rate as a penalty for prediction
To find the error rate between a predicted value and an actual value, we used the relative error model, which divides the absolute difference between two floating points:
Since our prediction model is predicting the sensor_id value, our contribution is to find an alternative sensor that can give us the data readings with some tolerance. We did not compute the error rate between the actual value of sensor_id and the predicted sensor_id value; instead, it has been computed by finding the relative difference among readings that are given by the actual sensor and the readings by the predicted sensor, taking into account the spatiotemporal characteristics within the dataset.
To compare among our proposed methodology and traditional logistic regression, we considered the case
Average error rate according to
NAP: Not APplicable.
Bold values refer to the minimum error rate at each dataset.
Figures 5–7 demonstrate the average error rate for each sensor node for Data Sensing Lab second floor, where each line illustrates a specific

Average error rate for each sensor node in Data Sensing Lab second floor dataset.

Average error rate for each sensor node in Data Sensing Lab third floor dataset.

Average error rate for each sensor node in Intel Data Lab dataset.
Finally, Table 3 shows the results of applying t-test analysis at 95% confidence interval. The results indicated that the difference between traditional load distribution (at
Results of statistical t-test.
M-value: mean value of data; SD-value: standard deviation of data; t-value: measure of the difference with respect to data variation;
We noticed during experiments that there is an indirect relation between the number of clusters and the degree of significance. In other words, the higher the number of clusters, the lower the
Energy consumption
To test the effectiveness of our proposed methodology in minimizing the power consumption, due to cutting off redundant activities and distributing the processing load over the network, we used to simulate the environment and compute the energy consumption for each sensor in the dataset. Tables 4–6 show the normal consumption rate, the consumption rate after applying our prediction technique, the amount of reduction, and the percentage of enhancement.
Energy consumption and reduction in Data Sensing Lab second floor’s sensors.
Energy consumption and reduction for Data Sensing Lab 3rd floor’s sensors.
Energy consumption and reduction in Intel Data Lab’s sensors.
The results in Table 4 showed that the average reduction in energy consumption is 7.34 mV; the average enhancement rate is 21%. For sensors in which there was zero reduction, this is because our proposed methodology failed to find a substitution. Furthermore, there is a significant difference between the sum of energy consumed by traditional method and the predicted one. The reason behind such difference is the removal of redundancy (i.e. tasks that are sub or exactly like other tasks).
In the third floor dataset, the average reduction in energy consumption was significantly lower than the second floor. It was 5.33 mV or 27% of the reduction at the second floor. Furthermore, the average enhancement rate was 18% as compared to 21% at second floor. Table 5 shows the energy consumption and reduction rate in Data Sensing Lab third floor dataset.
Finally, Table 6 showed a significant difference in energy reduction rate as a comparison between the first two datasets and the last one, Intel Data Lab. The average reduction in energy was 35.1 mV and the average enhancement rate was 40%.
Conclusion and future work
One of the major problems that WSNs face is the limited energy source of sensors. In this research, we presented an energy-aware prediction model for WSNs that are connected to a cloud, which increases the heterogeneity of WSNs. We considered spatiotemporal characteristics by incorporating spatial distribution of sensors in a modified version of MLR model, to be based on
Furthermore, the results showed that predicting alternative sensors using our proposed model increased the lifetime and reduced the energy consumption in the networks. Moreover, we proved that considering the spatiotemporal issue and applying multiple iterations of
Based on the previous experiments, we can conclude that the modified MLR with
However, we are looking forward to enhance the method of finding the threshold value of the remaining power in a sensor

Area chart of remaining power in sensors.
Where the area under the curve equals
