Sage Journals: Discover world-class research

Abstract

Wireless sensor networks have become integral components of modern and smart environments. The main challenge for such important data-acquisition tools is the limited amount of available energy. In integrated networks in which cloud systems act as a self-regulatory controller, distributing the computational load among available partitions with rich energy will positively influence the lifetime of the whole network. This article investigates the application of a modified version of multinomial logistic regression model that incorporates spatiotemporal aspects of data collected from smart environments. The contribution of this research is to propose an energy-efficient load balancing strategy based on the proposed prediction model for the purpose of enhancing the lifetime of wireless infrastructure. Our proposed algorithm grows linearly in terms of time complexity. Extensive experiments have been performed to measure the prediction error rate and the energy consumption. The results showed that the proposed model significantly reduces the error rate and distinctly maximizes the lifetime of wireless sensor networks.

Keywords

Wireless sensor networks smart environment smart cities spatiotemporal logic multinomial logistic regression load balancing data mining

Introduction

The world is encountering growing urbanization while major cities have become a driver for economic growth. Consequently, urban areas transform into smart cities, where decisions are made by self-organized and autonomous networks. Several existing smart infrastructures have been developed by multiple integrated wireless sensor networks (WSNs), in which sensors are responsible for acquiring data from real-world environments. One critical challenge in this domain is the limited energy that is available for every sensor, where batteries are the main source of power. Such problem directly affects the lifetime of such integrated networks and, thus, their sustainability.

Figure 1 shows an example of multiple integrated and heterogeneous WSNs in a smart environment, in which a cloud infrastructure controls such integration. In this platform, real-time processing is an essential operation in which cloud services respond immediately to requests. However, integration among multiple networks added more complexity in maintaining power consumption (load balancing). The problem arises when sensor nodes have different energy utilities and power consumption strategies. For instance, the system might fail to provide its desired functionality if some sensors are running out of power. Therefore, reliability and availability are significant attributes in designing such systems, in which sensory nodes are distributed over a large area.¹

Figure 1.

Integrated and heterogeneous WSN in cloud infrastructure.

This article investigates the application of a modified version of multinomial logistic regression (MLR) for the purpose of using spatiotemporal characteristics of a given environment to develop a dynamic load balancing strategy, in which a prediction model is applied to estimate the best set of sensors that could be used as a substitution to expand the lifetime of such networks; it models the nominal variables in which the outcomes are depicted as a linear combination of the predictor variables.

Since the problem of distributing energy load is a multiclass one (as it is dependent on the type of sensor, time, location, and context), generalizing the logistic regression model by predicting the discrete outcomes (sensors IDs) would enhance the development of the load balancing strategy. Given a set of independent variables $X = {x_{1}, x_{2}, \dots, x_{n}}$ and a set of estimating parameters $b = {b_{1}, b_{2}, \dots, b_{n}}$ , the problem is to find a discrete value (or set of values) that satisfies the following probability distribution model

$f (x, b) = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{n} x_{n}$ (1)

The contribution of this research is to redefine the feature set $b = {b_{1}, b_{2}, \dots, b_{n}}$ to reflect the spatiotemporal features of the environment and to apply best k clustering algorithm to independent variables $X = {x_{1}, x_{2}, \dots, x_{n}}$ to adjust the prediction task (optimization problem). In other words, the main objective of this research is to propose a prediction technique, which predicts the best set of sensors that can handle a specific task for dynamic load balancing, where a job will be assigned to one or more sensors after guaranteeing that the chosen nodes will achieve the network coverage, as a solution to extend batteries’ lifetime for integrated WSNs. Then, the scheduling task will take place with enhanced min–min scheduling problem to consider the locations of sensors and the temporal relations among these locations.

The rest of this article is organized as follows: section “Literature review” discusses and compares the related research with the proposed one. Section “Background” provides a formal description of preliminary definitions and equations that provide the baseline for the proposed methodology. Section “Space parameterization” describes our formal contribution in incorporating spatial distribution of sensors in the MLR model. Section “Dynamic load balancing” describes the proposed algorithm for dynamic load balancing strategy. Section “Modeling the threshold value” discusses our methodology in computing rounded threshold value. Section “Experiments and result” describes and discusses the experiments that have been conducted to evaluate the output in terms of error rate and energy consumption. Section “Conclusion and future work” summarizes the conclusion and the suggested future enhancements of this research.

Literature review

There are many energy-aware approaches proposed for solving the problem of limited energy in WSNs. According to Soua and Minet,² prediction-based energy-efficient techniques are classified into three categories based on (1) data level such as reducing data production, transmission, and processing; (2) routing protocols by choosing shortest paths; and (3) overhead reduction. Since our contribution focuses on the data-level techniques, the rest of this section illustrates and compares only data-level techniques.

Recent research focused on developing optimal communication techniques at data level to minimize the number of communicating messages, which would reduce the energy consumption rate.^3–7 Contextually, the consumption rate depends heavily on two sources: computation and transmission. Specifically, the amount of energy that is required for computation is defined as the amount of electrical power that is required by a node (sensor or central node) to execute a set of instructions.⁸ Although computation is an important source, previous research on similar applications have proven that this part could be ignored as the amount of energy that is required by transmitting data (function of packet size and distance among nodes) is always significantly larger than the amount required by computation.⁹ Our proposed methodology adopts such assumptions and expands the application of such techniques to resolve the problem in a heterogeneous environment.

Furthermore, few researchers^10–13 have proposed prediction techniques to reduce data transmission to reduce energy consumption. In Lu et al.¹⁴ and Wei et al.,¹⁵ data aggregation has been applied as a basis for predicting data in energy-efficient frameworks. Moreover, Kerasiotis et al.¹⁶ have applied a prediction model to justify the battery consumption in residual energy. Unfortunately, these efficient techniques were not designed to handle heterogeneous data that have been generated from different WSNs.

Carvalho et al.¹⁷ considered multivariate spatiocorrelation in their work, using multiple linear regression, to reduce data transmission and to enhance prediction accuracy. While our proposed technique uses MLR to find the alternative sensor within the same cluster, in which these clusters are based on Euclidean distance.

An energy-efficient solution for WSNs to predict the missing values of sensors according to dispersion of those sensors (deviation-based estimation framework (DEF)) was proposed by Zamil et al.⁸ Depending on the distance and the time to generate accurate values, this technique leads to minimize the power consumption with the help of neighbors’ data and historical data. Since cloud integrated environments allow for multiple sources of data, complex data transformations are required. We resolve such complexity by normalizing data variables, at abstract level rather than data level.

Research by Aderohunmu et al.¹⁸ and Stojkoska et al.¹⁹ discussed a data reduction strategy to conserve power consumption using a naive prediction algorithm that is light in terms of computational overhead on a node (limited). The predicted values rely on the last observed value. The model basically relies on the linearity of the data sample and also the variance where it should not be significantly larger than zero. We adopt similar technique in modeling independent linear variables.

The scheduling algorithms that have been used to schedule tasks in hierarchical environments, such as cloud and grid computing, did not take the spatio features into consideration. Scheduling algorithms such as min–min, max–min, and duplex (which is a combination of the last two algorithms) focused specifically on the execution time or the completion time.²⁰ The traditional heuristic min–min and max–min algorithms that have been proposed by Ibarra and Kim²¹ focused on finding the smallest finishing time, where min–min first chose the unassigned task with the minimum execution time and the same for the next unassigned task. However, max–min algorithm chooses first the unassigned task with the minimum time, but chooses the next unassigned task with the maximum execution time.²²

Min–min does not consider space or workload. A previous work by Chen et al.²³ has enhanced min–min to consider workload, while Liu et al.²⁴ took the quality of service, dynamic priority model, and cost service into consideration, but not the space. Another research has combined min–min and max–min to overcome the obstacles of each algorithm when it is applied individually, but without considering the spatiotemporal issue.²⁵ Our contribution in this research resolves this problem by redefining the min–max tradition algorithm to implicitly consider the spatiotemporal aspects of data tuples.

Background

Let $D$ be a dataset with $n$ observations (history), in which a given feature can take one of several discrete values ${0, 1, 2, \dots, K}$ . Let $Y = {s_{1}, s_{2}, \dots, s_{j}}$ be a set of nominal choices that represent sensors in an unordered (maybe random) form. The prediction task is defined to choose the set $E = {s_{i}, s_{i + 1}, \dots, s_{i + n}}$ in which $n \geq 2$ and $E \subseteq Y$ .

The probability that ith sensor picked for the jth category is given as follows

$\Pr {Y_{i} = j} = β_{ij}$ (2)

And

$\sum_{j = 1}^{J} β_{ij} = 1, \forall (i) = {1, \dots, n}$ (3)

The parameter $β_{ij}$ represents the value of the probability function $\Pr$ for jth category. Equation (3) shows that parameter $β_{ij}$ is normally distributed; the sum of all probability values is equal 1.

Since the environment of integrated networks also has independent variables, such as time and location of requests and responses, we define the vector $X = {x_{1}, x_{2}, \dots, x_{f}}$ in which $| X |$ is the total number of effective features or the features that contribute directly in predicting the nominal choice. Therefore, the nominal probability will be modeled as follows

$\log (\frac{β_{ij}}{β_{iJ}}) = X_{i} β_{j}, \forall (j) = {1, 2, \dots, J - 1}$ (4)

In a problem where multiple categories (sensors) are required to be predicted with a given set of independent features (spatiotemporal characteristics), the probability function for jth category and ith independent variable is defined as follows

$\begin{array}{l} P r (Y_{i} = j | X_{i}) = β_{i j} = \frac{E X P (X_{i} β_{j})}{1 + \sum_{j = 1}^{J - 1} E X P (X_{i} β_{j})}, \\ \forall (j) = {1, 2, \dots, J - 1} \end{array}$ (5)

And

$\begin{array}{l} P r (Y_{i} = J | X_{i}) = β_{i J} = \frac{1}{1 + \sum_{j = 1}^{J - 1} E X P (X_{i} β_{j})}, \\ \forall (j) = {1, 2, \dots, J - 1} \end{array}$ (6)

Notice that when $J > 2$ , the model is called multinomial logistic model.

Finally, to estimate a choice from a given data (historical data), a likelihood function must be defined based on the fact that $\Pr {Y_{i} = j} = β_{ij}$ . Therefore, the probability that a given ith choice meets jth category depends on a given set of historical data $D$ , the likelihood function is defined as follows

$f (β | D) = \underset{i \in Category 1}{Π} β_{i 1} \underset{i \in Category 2}{Π} β_{i 2} \dots \underset{i \in Category J}{Π} β_{iJ}$ (7)

Space parameterization

Regression analysis is a statistical approach for capturing relationships among variables (features). Deciding which regression model to use depends on three factors: number of the independent variables, the shape of the regression line, and the type of the dependent variable. The most used regression models are linear regression, logistic regression, polynomial regression, and stepwise regression.^26,27

Linear regression is used when the shape of the regression line is linear (i.e. $J = 1$ ). Logistic regression, on the other hand, is used to predict categorical types of the dependent variable, based on continuous or categorical independent features when the relationship between the dependent variable and the independent variables is not linear. However, if the dependent variable is categorical but the order of its value is important, then the regression will be ordinal logistic regression. Moreover, if the dependent variable has more than two classes, then it is an MLR. This research addresses a multiclass (greater than two) labeled categorical features, in which MLR model is fit.

Incorporating spatial aspects of data into a multinomial regression model required grouping data into a meaningful pool. One of the well-known methods is the k-means clustering. k-Means clustering is based on divergence among objects in which a distance function is defined to reflect the amount of space that separates given objects among each other’s. Equation (8) shows the distance (between object o and a given cluster c) function that is used in k-means clustering²⁸

$E = \sum_{j = 1}^{k} \sum_{o \in C_{j}} dist (o, c_{j})^{2}$ (8)

Suppose $D$ is a dataset, where the objects in $D$ are distributed into $k$ clusters, ${C_{1}, \dots, C_{k}}$ , where $C_{j} \subset D$ , and $C_{j} \cap C_{v} = \emptyset$ for $(1 \leq j and v \leq k)$ . $dist (o, c_{j})$ is the Euclidean distance between objects $o \in C_{j}$ and $c_{j}$ is the centroid of cluster $C_{j}$ .

There are mainly two general equations that are required to apply MLR formulas.²⁷ For each category, there is an equation except for the reference category. The categories refer to the class-labels that the MLR will predict. In our case, each cluster has different categories (class-labels), the members of the clusters. Thus, the number of equations equals the number of cluster members −1

$\begin{matrix} Z = b_{0} + b_{1} x_{i 1} + b_{2} x_{i 2} + \dots + b_{l} x_{il}, \\ \forall (i) = {1, 2, 3, \dots, n} \end{matrix}$ (9)

$\begin{array}{l} \ln (o d d s (e v e n t)) = b_{0} + b_{1} x_{i 1} + b_{2} x_{i 2} + \dots + b_{l} x_{i l}, \\ \forall (i) = {1, 2, 3, \dots, n} \end{array}$ (10)

where Z is the log odds of dependent variable $Senso r_{id} = \ln (odds (event))$ . Log odds are the natural log of the odds of the dependent variable. The log odds equal the natural log of the probability of the occurred event divided by the probability of the event that is not occurring

$\ln (odds (event)) = \ln (\frac{prob (event)}{prob (non_event)})$ (11)

$b_{0}$ , on the other hand, is a constant or intercept coefficient in the equation. It reflects the log odds of the dependent variable when model predictors are evaluated at zero. $b$ is the logistic regression coefficient, also called parameter estimates, via maximum likelihood method. $x$ is an independent variable. $l$ is the number of independent variables. And $n$ is the number of the readings.

ln(probability of event Y equals a category s)/(probability of event y equals the reference category r) can be simplified as follows

$\begin{array}{l} \ln (o d d s (e v e n t y)) = \\ \frac{P r (y = s \in C_{j} | x_{1, 2, 3, \dots, l}) / 1 - P r (y = s \in C_{j} | x_{1, 2, 3, \dots, l})}{P r (y = r \in C_{j} | x_{1, 2, 3, \dots, l}) / 1 - P r (y = r \in C_{j} | x_{1, 2, 3, \dots, l})} \end{array}$ (12)

where $y$ is an event. $s$ is the current category (given value), which is a member of cluster $C_{j}$ . $r$ is the reference category (reference value), which is a member of cluster $C_{j}$ . $x$ is an independent variable.

Furthermore, equation (12) can be simplified as follows

$P r (y = s | x \in C_{j}) = \frac{\exp (b_{0} + b_{1 s} x_{i 1} + b_{2 s} x_{i 2} + \dots + b_{l s} x_{i l} for i = 1, 2, 3, \dots, n)}{1 + \exp (b_{0} + b_{1 s} x_{i 1} + b_{2 s} x_{i 2} + \dots + b_{l s} x_{i l})}$ (13)

$Exp (b)$ is the odds ratio for an independent variable that equals the natural log base $e$ raised to the power of $b$ , the factor by which the independent variable increases or (if negative) decreases the log odds of the dependent variable.

Equation (13) shows our contribution in incorporating the spatial characteristics of a cluster of sensors into logistic regression. Restricting the independent variables to the set of values X reflects how a set of sensors are spatially related. In this case, we applied the traditional Euclidean distance. However, this definition allows for defining more complex spatial relations among a cluster of spatially related sensors.

Dynamic load balancing

According to the operational definitions that have been mentioned previously, Figure 2 demonstrates the proposed dynamic load balancing algorithm.

Figure 2.

Dynamic load balancing based on the proposed prediction model.

Our proposed load balancing algorithm takes eight parameters as input. The number of tasks $N$ such as $N = | T |$ , where $T = {T_{1}, T_{2}, \dots, T_{n}}$ is the set of tasks to satisfy a given request. $P t_{i}$ is the amount of power required to execute task $T_{i}$ . $P S_{j}$ is the amount of power required by a given sensor $S_{j}$ to sense the environment and send the data back to a specific base station. $IB C_{j}$ is the initial battery capacity of sensor $S_{j}$ . $R E_{j}$ is the amount of remaining power in $S_{j}$ . $Thr$ is a threshold (cutoff) value that is computed with respect to the remaining power within sensors in a specific cluster. Tasks will be checked against redundancy. For instance, if $Tas k_{1} \subseteq Tas k_{2}$ , this implies that $Tas k_{1}$ is redundant. The order of task execution is also restricted to give higher priority for shortest tasks.

The algorithmic description below illustrates the sequence of instructions to decide on the set of alternative sensors to substitute a given one. The outer loop at line 2 is responsible for analyzing tasks. First, it checks whether the task is a subtask of its predecessor or not (lines 3–9). If the condition is satisfied, then the large task will be picked; otherwise, the shortest will be picked.

The algorithm, then, computes the remaining power when the chosen task is executed and computes the current threshold value (lines 10 and 11). If the cutoff value is greater than the remaining power, an alternative sensor will be checked to choose (lines 13–17). Otherwise, the task will be assigned to its designated sensor.

Dynamic load balancing

Input:

N

: number of tasks

S

: set of sensors in a cluster

P t_{i}

: power consumption by task i

P S_{j}

: power consumption by sensor j

IB C_{j}

: initial battery capacity for sensor j

R E_{j}

: remaining power in sensor j

Th r_{z}

: threshold of remaining power in a sensor

Output:

t

: task after processing

Begin

1. Receive requests from user (set of tasks)

2. For (i = 1, i <= N, i ++) {

3. if (

T_{i} \subseteq T_{i + 1}

){

4. Subtasks = true;

5. Choose longest

T

}

6. Else {

7. Subtasks = false;

8. Choose shortest

T

;}

R E_{j} = IB C_{j} - P S_{j}

;

10.

Th r_{z} = \frac{\sum_{j}^{z} R E_{j}}{z}

;

11. If

(R E_{j} < Thr)

{

12. Check alternative sensors;

13. If (alternative sensors = true) {

14. For (j = 1, j <=|S|, j ++) {

15. If

(R E_{ja} < Th r_{z})

16. Continue;

17. Else {

18. Break;

19. Assign

T_{i}

S_{ja}

;

20. Return

T_{i}

;}}}

21. Assign task to

S_{j}

;}

22. Return

T_{i}

;

23. Update T;}

End

The number of tasks is always greater than the number of sensors during operational time (convergence), that is, $N >> | S |$ . And the number of sensors $| S |$ is always constant. The time complexity that is required for the proposed load balancing is $O (N)$ , where $N$ is the total number of processes tasks; that is, $| T |$ . This implies that the proposed strategy is linearly grown in terms of time complexity.

Modeling the threshold value

Let µ be the amount of energy that is required by a single read in millivolt. Then

$μ = \frac{Full energy}{Number of reads}$ (15)

By ignoring the computational part, consequently, the amount of energy consumed by a specific task is given by equation (16)

$P t_{i} = m \times μ$ (16)

where $P t_{i}$ is the amount of power consumed by a specific task, as m is the number of messages and µ is a constant that represents how much energy is consumed by sending one single message.

To recognize how much power a sensor has consumed, we computed the power consumed by each task that a sensor has performed. The summation of the power consumed by every single task will give us the overall power consumed by a sensor, given by equation (17)

$P s_{j} = \sum_{i}^{n} p t_{i}$ (17)

where $p t_{i}$ is the amount of power consumed by a specific task, as n is the number of tasks and i is the task ID.

A main feature that must be defined, in addition to the mission time, is the initial battery capacity (source of power), which is the value that the manufacturing company of the battery will provide. The battery capacity, after subtracting the power consumed by just operating a sensor without assigning any task to it along the whole mission time, is given by equation (18). The remaining power in the battery can be computed by subtracting the power consumed in a sensor from the battery capacity in that sensor as shown in equation (19)

$CB C_{j} = IB C_{j} - M T_{j} \times PIdl e_{j}$ (18)

$R E_{j} = CB C_{j} - P s_{j}$ (19)

where $CB C_{j}$ is the battery capacity, $IB C_{j}$ is the initial battery capacity, $M T_{j}$ is the mission time, $R E_{j}$ is the remaining power in the battery of a specific sensor, and $P s_{j}$ is the power consumption by specific sensor. For sensor j, we will assume that the power consumption in the idle time is equal to zero.

Another important feature that must be defined is the threshold of energy consumption by a task, as a sensor will be assigned multiple tasks; the threshold will be equal to the energy consumed by the shortest task. Thus, each round the value of the threshold will be different than the value in the previous one.

To prevent any sudden shutdown that will not achieve network coverage, a dynamic threshold to choose the suitable alternative sensor is demonstrated by equation (20). Therefore, if the remaining power for a specific alternative sensor is greater than the threshold, it will be chosen to perform the task. Otherwise, it will not receive tasks for the current round. Such operation is important to choose the tasks that would be scheduled to each sensor

$Th r_{z} = \frac{\sum_{j}^{z} R E_{j}}{z}$ (20)

where $Th r_{z}$ is the average remaining power of sensors, and z is the number of sensors. Thus, after choosing the longest task to be assigned, a sensor will be chosen among all the sensors, which can perform the job according to the threshold ( $Th r_{z}$ ). Therefore, a sensor with remaining power that is equal to $Th r_{z}$ will take the job.

Experiments and result

The experiments have been performed using IBM SPSS and WEKA tools to address the data mining tasks, such as preprocessing and clustering. Experimental results are twofold: error rate and energy consumption. The error rate indicates the performance of the prediction technique and its ability to estimate values. While the second experiments show the effectiveness of the proposed strategy in minimizing the energy consumption rate as compared to traditional technique.

Datasets

To test the proposed methodology, two datasets have been exercised: Data Sensing Lab²⁹ and Intel Data Lab dataset (available at: http://db.csail.mit.edu/labdata/labdata.html). Both datasets have been collected using integrated WSNs. Table 1 shows a description of each dataset. The Data Sensing Lab dataset has two parts: second and third floor sub datasets. Since both of them have different distribution, we considered them as separate datasets. Thus, the experiments have been performed on three datasets.

Table 1.

Dataset description.

	Data Sensing Lab	Intel Berkeley Data Lab
No. of sensors	40	54
Size	681,263 readings	2,300,000 readings
Distribution	Floors	Flat area
Platform	Wireless	Wireless
Technology	XBee	MICA2DOT

Results and discussion

This section aims to clarify the set of experiments that have been performed for the purpose of measuring the effectiveness of the proposed model. In other words, we provide evidence to test mainly two hypothesizes; the proposed model achieves low error rate in terms of prediction task and the proposed model minimizes the power consumption rate. Before measuring the error and the power consumption rates, we will briefly illustrate the application of our methodology on the datasets.

First, we performed the best k clustering for the purpose of grouping spatially related sensors into clusters. Since we have three datasets, we applied our methodology onto all of them. The results showed that the best k (the number of clusters) for the Data Sensing Lab dataset in the second floor is seven clusters, while it was four for the third floor sensors. However, for Intel Data Lab dataset, the best k clusters was 7.

Figure 3 demonstrates cluster memberships; members are the sensors that are deployed in the second floor of Data Sensing Lab, and it shows to which cluster they belong, the distance between each member, and its cluster center.

Figure 3.

Cluster memberships: k = 7 clusters (Data Sensing Lab second floor).

The same calculations have been performed for the other datasets: third floor and Intel Data Lab datasets. For each dataset, the best k has been computed and the membership tables have been generated as well. Such information would help us proceed in applying our prediction model onto all datasets.

Next, we simulated the environment using our proposed logistic model. Such simulation provided us the opportunity to apply our algorithmic strategy: (1) measuring power consumption using normal flow, (2) predicting alternative load distribution, and (3) measuring power consumption after applying the proposed load balancing strategy.

Figure 4 shows a snapshot from the results of applying our methodology on the first dataset. This snapshot illustrates an example of four groups (4, 7, 9, and 35). The independent variable is the sensor_id, which means we have to calculate three log odds (or Z in our previous description) because one of them is a reference category (reference value = 35), and the higher log odd value will be for the predicted outcome (category).

Figure 4.

Snapshot of parameter estimate for the multinomial logistic regression.

The second column in Figure 4 shows the intercepts, coefficient values for the reference category 35 (in case other coefficients were set to zero), and B values, which are the coefficients for the dependent variables 4, 7, and 9.

$Exp (b)$ (the 7th column in Figure 4) is the main effective measure for logistic regression; it shows the positive effect of temperature variable, higher than 1.0, and the negative effect of humidity variable, lower than 1.0, on the log odds of the independent variable sensor_id. Indeed, the same experiment has been applied onto the other datasets; we hide the others to avoid redundancy.

After collecting results of applying the prediction model on all datasets, we compared the efficiency of the proposed methodology by computing the error rate resulted from the prediction task and the amount of saved power consumption after applying our methodology. The following two sections show the results and discuss them in detail.

Error rate as a penalty for prediction

To find the error rate between a predicted value and an actual value, we used the relative error model, which divides the absolute difference between two floating points: $a$ is the actual value, and $p$ is the value that comes from a predicted alternative sensor, divided by the maximum absolute value between them, to find the relative difference as follows

$Er r_{R} = \frac{| a - p |}{\max (| a |, | p |)}$ (21)

Since our prediction model is predicting the sensor_id value, our contribution is to find an alternative sensor that can give us the data readings with some tolerance. We did not compute the error rate between the actual value of sensor_id and the predicted sensor_id value; instead, it has been computed by finding the relative difference among readings that are given by the actual sensor and the readings by the predicted sensor, taking into account the spatiotemporal characteristics within the dataset.

To compare among our proposed methodology and traditional logistic regression, we considered the case $k = 1$ ; which implies that the MLR has been applied on the whole dataset (single cluster). On the other hand, we compute the error rate at best k cluster of each dataset. Notice that the optimal k - value of the first dataset is seven, the 2nd dataset is four, and for the last one is seven. Table 2 shows the average error rate in each dataset according to the number of clusters. It demonstrates the reason behind choosing the number of clusters k for each dataset.

Table 2.

Average error rate according to k-value.

Dataset	k = 1Original	k = 2	k = 3	k = 4	k = 5	k = 6	k = 7
Data Sensing Lab second floor	0.035	0.029	0.026	0.0222	0.021	0.017	0.016
Data Sensing Lab third floor	0.035	0.035	0.027	0.024	NAP	NAP	NAP
Intel Data Lab	0.086	0.075	0.070	0.053	0.050	0.047	0.038

NAP: Not APplicable.

Bold values refer to the minimum error rate at each dataset.

Figures 5 –7 demonstrate the average error rate for each sensor node for Data Sensing Lab second floor, where each line illustrates a specific k cluster. The average error rate flaunted wildly over sensor nodes. Intersections between lines indicate that average error rate for that sensor node did not change, when k-value changed.

Figure 5.

Average error rate for each sensor node in Data Sensing Lab second floor dataset.

Figure 6.

Average error rate for each sensor node in Data Sensing Lab third floor dataset.

Figure 7.

Average error rate for each sensor node in Intel Data Lab dataset.

Finally, Table 3 shows the results of applying t-test analysis at 95% confidence interval. The results indicated that the difference between traditional load distribution (at k = 1) and our proposed load balancing strategy is significant in terms of error rate, the difference between real values and predicted ones.

Table 3.

Results of statistical t-test.

Dataset	k-Value	M-value	SD-value	t-value	p-value	Conclusion
Data Sensing Lab second floor	1	0.0353	0.0138	4.69	<0.0001	Significant
	7	0.0157	0.139
Data Sensing Lab third floor	1	0.0351	0.0130	2.318	0.0275	Significant
	4	0.0237	0.0148
Intel Data Lab	1	0.0862	0.0258	6.998	<0.0001	Significant
	7	0.0376	0.0232

M-value: mean value of data; SD-value: standard deviation of data; t-value: measure of the difference with respect to data variation; p-value: probability of obtaining t-value or higher. Conclusion: significant if p-value less than 5% (95% confidence interval).

We noticed during experiments that there is an indirect relation between the number of clusters and the degree of significance. In other words, the higher the number of clusters, the lower the p-value (significance). Furthermore, the number of clusters, according to our methodology, depends on the distances among sensors. Therefore, we can conclude that the spatial distribution of sensors affects the reliability of results and the efficiency of the prediction model.

Energy consumption

To test the effectiveness of our proposed methodology in minimizing the power consumption, due to cutting off redundant activities and distributing the processing load over the network, we used to simulate the environment and compute the energy consumption for each sensor in the dataset. Tables 4 –6 show the normal consumption rate, the consumption rate after applying our prediction technique, the amount of reduction, and the percentage of enhancement.

Table 4.

Energy consumption and reduction in Data Sensing Lab second floor’s sensors.

Sensor	Normal consumption	Prediction-based consumption	Reduction	Enhancement (%)
1	47.4	26.8	20.6	43
2	271.7	271.68	0.02	0
3	122.32	122.32	0	0
4	80.21	80.21	0	0
6	45.4	43.92	1.48	3
7	6.2	1.85	4.35	70
8	33.3	30.2	3.1	9
9	32.7	16	16.7	51
10	30.8	16.83	13.97	45
11	32.6	29.9	2.7	8
12	32	32	0	0
13	22.6	22.6	0	0
14	20.73	17.54	3.19	15
15	37.11	37.11	0	0
16	33.04	16.76	16.28	49
21	33.42	33.42	0	0
22	235.44	235.44	0	0
26	51.91	0.26	51.65	99
31	33	33	0	0
32	44.5	44.5	0	0
33	44.6	23.4	21.2	48
34	37.11	23.5	13.61	37
35	30.1	30.1	0	0

Table 5.

Energy consumption and reduction for Data Sensing Lab 3rd floor’s sensors.

Sensor	Normal consumption	Prediction-based consumption	Reduction	Enhancement (%)
5	52	43.31	8.69	17
17	48.7	48.68	0.02	0
18	37.3	29.8	7.5	20
19	41.79	30.87	10.92	26
20	21.23	18.88	2.35	11
23	50.18	50.18	0	0
24	51.9	46.67	5.23	10
25	45.73	30.19	15.54	34
27	20.99	20.99	0	0
28	44.96	44.96	0	0
29	44.86	44.86	0	0
30	46.31	40.82	5.49	12
36	26.19	11.02	15.17	58
37	12.93	8.16	4.77	37
38	15.36	5.67	9.69	63
40	14.08	14.08	0	0

Table 6.

Energy consumption and reduction in Intel Data Lab’s sensors.

Sensor	Normal consumption	Prediction-based consumption	Reduction	Enhancement (%)
1	107.35	76.78	30.57	28
2	110.02	110.02	0	0
3	114.91	76.81	38.1	33
4	98.37	33.76	64.61	66
6	77.94	29.47	48.47	62
7	129.57	129.57	0	0
8	48.51	18.18	30.33	63
9	113.96	54.74	59.22	52
10	103.83	55.32	48.51	47
11	102.43	102.43	0	0
12	57.72	5.34	52.38	91
13	77.56	7.14	70.42	91
14	79.15	11.15	68	86
15	6.08	4.25	1.83	30
16	76.06	52.12	23.94	31
17	81.58	43.64	37.94	47
19	90.29	90.29	0	0
20	83.38	12.18	71.2	85
21	154	46.18	107.82	70
22	157.58	157.58	0	0
23	146.27	95.72	50.55	35
24	129.18	62.24	66.94	52
25	122.65	122.65	0	0
26	128.13	128.13	0	0
27	21.74	15.2	6.54	30

The results in Table 4 showed that the average reduction in energy consumption is 7.34 mV; the average enhancement rate is 21%. For sensors in which there was zero reduction, this is because our proposed methodology failed to find a substitution. Furthermore, there is a significant difference between the sum of energy consumed by traditional method and the predicted one. The reason behind such difference is the removal of redundancy (i.e. tasks that are sub or exactly like other tasks).

In the third floor dataset, the average reduction in energy consumption was significantly lower than the second floor. It was 5.33 mV or 27% of the reduction at the second floor. Furthermore, the average enhancement rate was 18% as compared to 21% at second floor. Table 5 shows the energy consumption and reduction rate in Data Sensing Lab third floor dataset.

Finally, Table 6 showed a significant difference in energy reduction rate as a comparison between the first two datasets and the last one, Intel Data Lab. The average reduction in energy was 35.1 mV and the average enhancement rate was 40%.

Conclusion and future work

One of the major problems that WSNs face is the limited energy source of sensors. In this research, we presented an energy-aware prediction model for WSNs that are connected to a cloud, which increases the heterogeneity of WSNs. We considered spatiotemporal characteristics by incorporating spatial distribution of sensors in a modified version of MLR model, to be based on k-means clustering.

Furthermore, the results showed that predicting alternative sensors using our proposed model increased the lifetime and reduced the energy consumption in the networks. Moreover, we proved that considering the spatiotemporal issue and applying multiple iterations of k-means clustering have reduced the error rate among the actual readings and the ones induced from the predicted alternative sensors.

Based on the previous experiments, we can conclude that the modified MLR with k-means clustering had 36.28% energy consumption reduction in Intel Data Lab’s sensors, 14.86% in Data Sensing Lab third floor’s sensors, and 12.41% in Data Sensing Lab third floor’s sensors. Moreover, it reduced error rate in Intel Data Lab’s sensors to 55.46%, 32.54% in Data Sensing Lab third floor’s sensors, and 55.53% in Data Sensing Lab second floor’s sensors, as compared to state-of-the-art multinomial regression. All comparisons were statistically significant using t-test analysis.

However, we are looking forward to enhance the method of finding the threshold value of the remaining power in a sensor $Th r_{z}$ that we have mentioned in load balancing algorithm section, to choose which alternative sensor can perform the task. One of our recommendations to choose this value is finding the integration of the remaining power function as shown in Figure 8.

Figure 8.

Area chart of remaining power in sensors.

Where the area under the curve equals

$A = \int_{1}^{5} ydx$

Footnotes

Handling Editor: Luca Foschini

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) received no financial support for the research,authorship,and/or publication of this article.

ORCID iD

Mohammed GH Al Zamil

References

Swathi

Guruprasad

. Integration of wireless sensor networks and cloud computing. Int J Comput Sci 2014; 2(5): 49–53.

Soua

Minet

. A survey on energy efficient techniques in wireless sensor networks. In: Proceedings of the 2011 4th joint IFI wireless and mobile networking conference (WMNC), Toulouse, 26–28 October 2011, pp.1–9. New York: IEEE.

Al Zamil

Can

. ROLEX-SP: rules of lexical syntactic patterns for free text categorization. Knowl: Based Syst 2011; 24(1): 58–65.

Al Zamil

Samarah

. The application of semantic-based classification on big data. In: Proceedings of the 2014 5th international conference on information and communication systems (ICICS), Irbid, Jordan, 30 April 2014, pp.1–5. New York: IEEE.

Zamil

MGA

. Verifying smart sensory systems on cloud computing frameworks. Procedia Comput Sci 2015; 52: 1126–1132.

Zamil

MGA

Samarah

. Dynamic event classification for intrusion and false alarm detection in vehicular ad hoc networks. Int J Inform Commun Technol 2016; 8(2–3): 140–164.

Zamil

MGA

Samarah

. Dynamic rough-based clustering for vehicular ad-hoc networks. Int J Inform Decis Sci 2015; 7(3): 265–285.

Zamil

Samarah

Saifan

et al . Dispersion-based prediction framework for estimating missing values in wireless sensor networks. Int J Sens Netw 2012; 12(3): 149–159.

Wang

. Rough set text categorization rule extraction based on CHI value. Comput Appl 2005; 25(5): 1026–1028.

10.

Samarah

Al-Hajri

Boukerche

. An energy efficient prediction-based technique for tracking moving objects in WSNs. In: 2011 IEEE international conference on communications (ICC), Kyoto, Japan, 5–9 June 2011, pp.1–5. New York: IEEE.

11.

Samarah

Al Zamil

Saifan

. Model checking based classification technique for wireless sensor networks. New Rev Inform Netw 2012; 17(2): 93–107.

12.

Soni

Chand

Singh

. Reducing the data transmission in WSNs using time series prediction model. In: Proceedings of the 2012 IEEE international conference signal processing, computing and control (ISPCC), Waknaghat, India, 15–17 March 2012, pp.1–5. New York: IEEE.

13.

Msechu

Giannakis

. Sensor-centric data reduction for estimation with WSNs via censoring and quantization. IEEE T Signal Pr 2012; 60(1): 400–414.

14.

Valois

Dohler

et al . Optimized data aggregation in WSNs using adaptive ARMA. In: Proceedings of the 2010 fourth international conference on sensor technologies and applications, Venice, 18–25 July 2010, pp.115–120. New York: IEEE.

15.

Wei

Ling

Guo

et al . Prediction-based data aggregation in wireless sensor networks: combining grey model and Kalman Filter. Comput Commun 2011; 34(6): 793–802.

16.

Kerasiotis

Prayati

Antonopoulos

et al . Battery lifetime prediction model for a WSN platform. In: Proceedings of the 2010 fourth international conference sensor technologies and applications (SENSORCOMM), Venice, 18–25 July 2010, pp.525–530. New York: IEEE.

17.

Carvalho

Gomes

Agoulmine

et al . Improving prediction accuracy for WSN data reduction by applying multivariate spatio-temporal correlation. Sensors 2011; 11: 10010–10037.

18.

Aderohunmu

Paci

Brunelli

et al . An application-specific forecasting algorithm for extending WSN lifetime. In: Proceedings of the 2013 IEEE international conference distributed computing in sensor systems (DCOSS), Cambridge, MA, 20–23 May 2013, pp.374–381. New York: IEEE.

19.

Stojkoska

Solev

Davcev

. Data prediction in WSN using variable step size LMS algorithm. In: Proceedings of the 5th international conference on sensor technologies and applications (SensorComm), Nice, 21–27 August 2011. IARIA.

20.

Braun

Siegel

Beck

et al . A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. J Parall Distrib Comput 2001; 61(6): 810–837.

21.

Ibarra

Kim

. Heuristic algorithms for scheduling independent tasks on nonidentical processors. J ACM 1977; 24(2): 280–289.

22.

Freund

Gherrity

Ambrosius

et al . Scheduling resources in multi-user, heterogeneous, computing environments with SmartNet. In: Proceedings of the 1998 seventh heterogeneous computing workshop (HCW 98), Orlando, FL, 30 March 1998, pp.184–199. New York: IEEE.

23.

Chen

Wang

Helian

et al . User-priority guided min-min scheduling algorithm for load balancing in cloud computing. In: Proceedings of the 2013 national conference on parallel computing technologies (PARCOMPTECH), Bangalore, India, 21–23 February 2013, pp.1–8. New York: IEEE.

24.

Liu

. An improved min-min algorithm in cloud computing. In: Du

(ed.) Proceedings of the 2012 international conference of modern computer science and applications. Berlin; Heidelberg: Springer, pp.47–52.

25.

Priya

Subramani

. A new approach for load balancing in cloud computing. Int J Eng Comput Sci 2013; 2013: 2319–7242.

26.

Sykes

. An introduction to regression analysis, 1993, https://chicagounbound.uchicago.edu/cgi/viewcontent.cgi?article=1050&context=law_and_economics (accessed 6 August 2016).

27.

Garson

. Logistic regression: binary & multinomial: 2016 edition (statistical associates “blue book” series). Asheboro, NC: Statistical Associates Publishers, 2016.

28.

Han

Kamber

. Data mining. Amsterdam: Elsevier, 2006.

29.

Steele

. Data Sensing Lab, 2012, http://datasensinglab.com/data/ (accessed 9 May 2016).