Abstract
Keywords
Introduction
With the global energy problem becoming more and more serious, all countries in the world do research on the smart grid.1–3 The ultimate goal of the smart grid builds a comprehensive power system covering the whole production process, including power generation, power transmission, power transformation, power distribution, power dispatch, and power consumption, which can be seen as a panoramic view of real-time systems. Supporting smart grid secure, self-healing, green, strong, and reliable operations are based on the acquisition, transmission, and storage of the panoramic view real-time data, as well as the rapid analysis of huge amount of accumulated and multi-source data. 4 Big data in the smart grid usually show five key characteristics, mainly including volume, velocity, variety, veracity, and value, which also have been widely applied to intelligent transportation, 5 business, 6 medicine, 7 building, 8 and so on. In order to achieve an integrated collection of the subject-oriented big data in the support of making decision, the term data warehouse (DW) is proposed by Inmon 9 in the 1990s. Generally, components of a DW are designed as a multi-layer architecture in Figure 1. What should be focused on is the integration of data sources with the extract–transform–load (ETL), which takes charge of the extraction of the data from huge heterogeneous data source, the transformation of these data, and their loading into the DW. Therefore, the appropriate design and maintenance of the ETL processes become key factors in the success of DW projects.10,11 In order to achieve the optimized project, designers must delve into the inherent amount of complexities of this environment and technical details, who mainly focus on the design of a workflow that extracts data from these sources, cleans any inconsistencies they may have, transforms them according to the target format, and finally loads them into the target data store. 12 Even though many design strategies have been proposed to address ETL processes, guides are not enough in many times. For the convenience of carrying their work out, the subjective decisions are simplified, which lead to serious DW loading performance problems or they only focus on the ETL model optimization.13–15 Therefore, in combination with existing ETL processes involving many activities organized as a workflow, it is necessary to give a fresher look on workflow schedulers.16,17 Recent literatures aim to achieve maximum possible parallelism among tasks at a level of a workflow while minimizing the system overheads and resource wastage, which focus on task scheduling strategies.12–17

The multi-layer architecture of data warehouse.
ETL processes include many processing workflows, in which there exist certain constraint relationships. How these processing workflows are efficiently scheduled is a key problem in the implementation of ETL, which plays a vital role in improving the development efficiency and source utilization rate of the DW. Workflow matching and scheduling problems can be considered as a non-deterministic polynomial (NP) problem, 18 and it is impossible to achieve the optimal solution to the problem. Conventional algorithms focus on the directed acyclic graph (DAG) of a workflow scheduling, which is different from multi-workflow scheduling of the ETL in the DW. In addition, above methods decrease the operation time by reducing the operation quantity and changing the operation order, which scarcely do research on the allocation and scheduling of ETL activities. A scheduling framework is put forward in the DW, even though the static scheduling, dynamic scheduling, and same layer division are carried out, and the accurate scheduling model and overall algorithm description are left out. 19 A greedy algorithm is also applied to the optimal workflow scheduling, which is limited to only a workflow and cannot guarantee the performance of the multi- workflow condition. 20 The derivation mode of the primary table is proposed to optimize the ETL process, and a pipeline optimization method is provided for the ETL operation, which is based on the premise that all ETL activities are constrained serially and lack the generality to some extent. 21 A possible physical implementation of an ETL workflow is put up, including logical-level description and an appropriate cost model as inputs, but which neglects the workflow operation in detail. 22 In order to search for alternative physical implementations with lower cost, this algorithm is extended by intentionally introducing sorting activities in the workflow, but comparative experiments are not shown, including workflow styles and the algorithm itself. 23 These drawbacks further motivate the improvement of ETL workflow schedulers, and efficient ETL operations have become a research topic to achieve the minimum of the ETL operation time and memory consumption.
This article is organized as follows: section “Background” describes the background needed to introduce our application platform, data characteristics, and ETL schedulers in China Southern Power Grid (CSG) big data. Then, the problem formation is proposed in combination with the background in section “Problem formulation.” Section “Scheduling algorithms for ETL workflows” puts up workflow scheduling algorithms for the workflow, including round-robin (RR) algorithm, minimum-cost (MC) algorithm, minimum-memory (MM) algorithm, and mixture of the MC and MM (MCM) algorithm. Aiming at subflows of a workflow, two algorithms are integrated into above algorithms in section “Operation of a workflow composed of subflows.” Finally, experiments are carried out in terms of proposed algorithms each other, comparison of different algorithms, and robustness performance evaluation.
Background
Application framework of CSG big data
Apart from the electrical aspect, the smart gird has become an interesting research topic for data scientists, which is mainly composed of information technology, computer technology, communication technology, transmission, and distribution power infrastructure. 2 Aiming to achieve steady availability of production control, operation management, status measurement, risk assessment, social economic situation analysis and prediction, and so on, new big data platform applications are being studied. 24 An application framework of CSG big data is set up and given in Figure 2. Sources of big data mainly come from energy management system (EMS), distribution management system (DMS), automatic measurement system (AMS), marketing management system (MMS), customer service system (CCS), geographic information system (GIS), weather prediction system (WPS), and social economy data (SED). These complex and huge data need to be preprocessed before transmitted to the core of the smart grid big data platform, and these processes can be called as data integration and fusion which play a crucial role in the ETL process. Among the core of the smart grid big data platform, data analysis, process, and management are involved with many aspects based on specific algorithms. Platform control mainly achieves the monitor, scheduling, management, and backup restore.

The application framework of CSG big data.
ETL
ETL process
During the ETL process, the valuable big data are extracted from online analytical processing databases, then, transformed to match the DW schema, and finally, loaded into the DW, as shown in Figure 3.14,25 As data sources are changing, the DW will be periodically updated, and the ETL process is not a one-time event. Therefore, it is concluded that the ETL process must be designed for easy modification; meanwhile, ETL operations can be scheduled properly. 26

The framework of integration and fusion.
The extraction step is responsible for extracting data from the source data, and each data source has its own characteristics which distinctly increase the extraction difficulty. It is necessary to extract them with different methods which are introduced in section “Big data analysis.” 27 The transformation step tends to make some cleaning and conforming on the incoming data to gain accurate data which are correct, complete, consistent, and unambiguous. This process includes data selection, separation, union, conversion, and collection, which defines the granularity of fact tables, dimension tables, DW schema, derived facts, slowly changing dimensions, and factless fact tables. All transformation rules and resulting schemas are described in the metadata repository. The final step is loading data which loads data to the target multidimensional structure. In this step, extracted and transformed data are written into the dimensional structures actually accessed by the end users and application systems. Meanwhile, loading step includes both loading dimension tables and loading fact tables.
Big data analysis
Big data are worthless in a vacuum, whose potential value is unlocked only when leveraged to drive the decision. To improve the decision efficiency, these complex data need to be converted into meaningful insights in the limited time and storage. In the following sections, we clarify analytical techniques of CSG big data based on the source data category, named as structured, semi-structured, and unstructured data. 28 Structured data in CSG mainly include EMS, DMS, AMS, MMS and CCS information, equipment information, and inventory data, which are stored in the dedicated and relational databases, like Oracle, MySQL, Postgres, and Teradata. Sqoop realizes the data interaction between the dedicated and relational database and Hadoop Distributed File System (HDFS). 29 Semi-structured data usually cannot be extracted directly but owns a certain structure itself, including the typical object exchange model (OEM), common knowledge base, word, pdf, email, and so on. 30 Flume NG is a distributive, reliable and available treatment method, which is widely applied to the collection, aggregation, and transmission of mass data. With the merit of the structure flexibility, the extraction efficiency is improved by many agents. Unstructured data refer to irregular data which cannot be described in two-dimensional logical table, especially like the monitor video, 95598 service audio, and GIS figure. Here, kettle strategy can adjust different structure data based on transformation and job script. The former realizes the data basis conversion, and the later achieves the overall workflow control. 31 In order to improve the real-time workflow efficiency, Apache frameworks of Storm, Spark, and Samza have been proposed to deal with complex specific requirements.
ETL workflows
ETL processes are composed of many complex ETL workflows, which are responsible for the DW maintenance. Therefore, optimal workflows must be guaranteed and scheduled properly. In the past, parallelism scheduling updates and queries in the real-time warehouse and the workflow scheduling in data stream management systems are studied.32–34 It is regrettable that no available and effective information are specially designed for workflows in their internals. Considering offline and batch cases, workflows on the entire ETL process from the source to the warehouse are studied.
In order to analyze the internal mechanism, the Aurora stream manager further divides the scheduling operations into more streams, 35 and the control objective minimizes the following criteria, including the operation time, latency time, and memory. A query is operated in a data stream system for the sake of minimizing the required memory. 36 The key step of the scheduling is to select an operation path which removes the largest data as soon as possible. Meanwhile, some algorithms are proposed to improve the system operation time. 37 ETL workflows can be designed as the stream processing style to achieve the optimization of the operation time and memory consumption. However, ETL workflows include many complex processes, and considering the offline ETL specificity, the optimal objective achieves the overall minimum operation time instead of only transmitting tuples to the ultimate users as soon as possible. At the same time, a workflow is divided into many subflows.
Problem formulation
A DAG models activities and recordsets on the basis of the overall layout of an ETL workflow. The term activities mean any software module that processes the incoming data. The term recordsets refer to any data storage that obeys a schema (like relational tables and record files). Recordsets and activities are logical abstractions of physical entities. Then, the logical-level workflow is refined at the physical level, and a combination of executable scripts that perform the ETL workflow is designed. Finally, each activity of the workflow is physically implemented using various algorithms, which is evaluated in terms of the operation time or memory consumption.
Workflow scheduler
In order to express clearly, we formally model an ETL workflow as a DAG

An ETL activity structure.
From Figure 4,
where
The memory size of the queue
In combination with above definitions and analysis, the ultimate objective is to find optimal scheduling methods for a workflow, which is described as follows:
The scheduling method divides
One of the objective functions is minimized in the following: (1)
In a word, the above problem focuses on searching for scheduling algorithms, which realizes the minimum of the operation time and maximum memory consumption.
Workflow division
Owing to a large workflow composed of many fragments, we can further divide the workflow into appropriately connected subflows, which will improve the parallel processing ability. In particular, advantages are shown in the multiprocessor and multiserver system. Every subflow has its own property which can make the pipelining of intermediate results between its activities, and it is not necessary to store them in the persistent storage. Thus, a shrinking version of the graph is generated as a side effect. Likewise, subflows become nodes of the shrinking graph. A division of the shrinking graph is composed of a set of disjoint subflows
With a subflow graph
Strata
Strata There is one incoming edge from There may be incoming edges from any stratum There is no other incoming edges.
The proposed improved sorting algorithms can achieve the above stratification, which are different from common topological structures and described in the following Pang et al. 38 Once the strata of the subflow graph have been recognized, each stratum can be activated in its own turn. At this time, subflows with the same strata can operate independently with different scheduling strategies. In combination with the above analysis, ETL workflows are divided into many subflows. In order to decrease the total operation time and memory consumption, relative algorithms are proposed for the ETL and subflows, respectively.
Scheduling algorithms for ETL workflows
Scholars classify existing scheduling algorithms for malleable parallel jobs into three categories, 39 including list algorithm, longest processing time algorithm, and optimizing the middle algorithm, which are all based on some criteria. Here, three genetic algorithms are introduced, namely, RR algorithm, MC algorithm, and MM algorithm, which should obey different criteria in Table 1. From Table 1, we find that which activity is favored each time when the scheduler is called, and before a new decision is scheduled, how long the selected activity will continue to operate.
Different criteria of three algorithms.
RR: round-robin; MC: minimum-cost; MM: minimum-memory.
RR
The RR algorithm deals with all activities fairly, which arranges time slices to the relevant activity in an order based on an unique identifier of every activity, and the pseudo-code is as follows
MC
The MC algorithm is proposed to reduce the operation time of ETL workflows. Therefore, communications among the activities are minimized as few as possible. In addition, all data of the selected activity should be ready for processing, especially, the largest input data number of the activity. Since there is no time slot, the selected activity processes all data from the scheduler in succession. In order to simplify the problem, it is assumed that all activities read data from an external source, which is available for the operation, and the pseudo-code is described as follows
MM
The MM algorithm schedules ETL workflows to improve the memory arrangement. In general, the MM algorithm consumes the biggest data number in an activity. In practice, the number of data that an activity consumes is the data that the activity removes from the memory, either by rejecting or writing them into a file for a specific portion of time. The smaller are the selectivity, large processing rate, and input size, the better is the scheduling of an activity. An index of the consumption rate in equation (1) is put up, with which the MM selects the activity with the biggest
MCM
In order to compare the performance of above algorithms, we choose the typical DMS, 95558 with structure analysis and GIS, for example, in CSG, mainly including District 1 (D1), District 2 (D2), and District 3 (D3). The overall amount of above data are collected and shown in Table 2.
Local data collection and statistics.
In combination with data characteristics, Sqoop, Flume NG, and Kettle are designed to deal with different types of data. In addition, RR, MC, and MM algorithms are applied to show the average operation time and memory consumption in Figures 5 and 6.

Average operation time of the RR, MC, and MM algorithms.

Average memory consumption (#row packs) of the RR, MC, and MM algorithms.
Experiment results clearly show that the MC algorithm is faster than the RR algorithm, and the MM algorithm is the lowest among three algorithms in terms of the average operation time, as shown in Figure 5. However, the MM algorithm is better than RR and MC algorithms in terms of the average memory consumption, as shown in Figure 6. Therefore, what we should deal with in practice is to balance the average operation time and memory consumption. Aiming at improving operation efficiency, the memory consumption can be permitted with the development of mass storage devices to some extent, and parallel processing strategies can be introduced. Consequently, workflows are designed to further be divided into subflows to boost the workflows with parallel operations at the expenses of the memory consumption, namely, the MCM algorithms. The MC algorithm is only used in simple workflows, and the MM algorithm is designed to address complex workflows involving memory consuming and blocking operations. The experiment results show that the MCM algorithm is superior to single algorithm, especially when the workflow is complicated and contains many tasks.
Operation of a workflow composed of subflows
The DAG is well known as a workflow model, which not only can reflect data dependency between workflows but also control dependency. 40 Thus, a subflow also involves this problem in the same way. However, there are few papers that automate the procedure of identifying subflows that can be operated in parallel, which mainly focus on the manual operation relying on designers’ experience. 41 In particular, the operation time must be reduced to guarantee system real-time requirements in the limited memory. In this section, a parallel scheduling mechanism (PSM) is presented in Figure 7.

Structure of a parallel scheduling mechanism.
As shown in Figure 7, the first step realizes the subflow prioritizing and then determines that some subflows become ready, which mainly includes the finish and arrival event of the subflow. After that, the following steps are carried out serially and then attempt to schedule subflows in the waiting queue based on the availability of the system source data. The PSM adopts simple workflow scheduling (SWS) in the subflow prioritizing stage, 42 which simply puts each ready subflow into the waiting queue, and gives out two algorithms which are shortest-subflow-first (SSF) and priority-based backfilling (PB).
SSF
The waiting queue scheduling stage in the PSM usually adopts the approach used in hybrid (HYBD) algorithm,
42
which calculates the rank value of each task according to the definition in Heterogeneous Earliest Finish Time (HEFT) algorithm.
43
The definition of rank is given below based on computing each task length of critical path from a task
where
In general, the scheduler sorts tasks in the descending order by the rank value when all tasks in the waiting queue come from the same subflow. If there are multiple subflows in the queue, tasks are sorted in the ascending order relying on their rank values. The strategy in Yu and Shi 42 is the same as the shortest-job-first (SJF) method, but which cannot deal with a subflow with many tasks efficiently because of low-rank tasks with different crossed subflows. Therefore, an algorithm named as SSF should enhance the SJF in the waiting queue scheduling stage to decrease the average turnaround time of all subflows, in which the scheduler calculates the estimation remaining operation time (EROT) of each subflow whenever a new subflow arrives. Then, tasks in the waiting queue are first sorted in the ascending order by the EROT of subflows they belong to. Finally, tasks coming from the same subflow are sorted in the descending order according to their rank values. Each time a task becomes ready, it is simply put into the appropriate position among the same subflow tasks in the waiting queue according to its rank value. The algorithm for the task prioritizing of SSF is described in the equation below, and the QuickSort function with a customized compare function is designed for the waiting queue. The compare function first aims at judging whether tasks belong to the same subflow. If so, the task with higher rank value has higher priority. Otherwise, the task of a subflow with shorter EROT will achieve a higher priority. The algorithm for calculating EROT (Cal_EROT) in SSF first collects ready tasks for the subflow and then finds the ready task with the highest rank subflow. Finally, the task is mapped to the source that produces the minimal EROT and checks whether any descendants of the task become ready until all the tasks have been mapped. The algorithm is described in detail as follows
PB
Algorithms proposed in Yu and Shi 42 result in reducing system CPU utilization, in which there exists idle space before additional ones are available because the next task requirements cannot be met by free processors. Therefore, it is necessary to improve the resource utilization and guarantee the overall system performance. Different backfilling strategies are proposed to reduce resource fragmentation by permitting tasks to run out of order as long as they do not delay certain tasks. 44 Common backfilling rearranges the waiting queue based on the task EROT. In this section, a waiting queue based on the priority is put forward with no more than one profile of every task in a linked list, and the profile mainly owns three attributes, including the estimation start time (EST), estimation finish time (EFT), and estimation allocated cluster (EAC). The improved PB algorithm is presented in the following. First, an empty linked list is initialized for possible profile storage. Second, considering their orders of all tasks in the waiting queue, the first future time instant is found in the linked list when enough resources are available in some clusters, and EST, EFT, EAC, and the linked list are updated. Third, a new task becomes ready and is put into the waiting queue at each time, and the scheduler will recreate the profile and update the estimation information of tasks. Finally, the scheduler allocates tasks in the ascending order of their EST instead of their priorities.
Experimental results and analysis
According to CSG data characteristics, a simulation platform is built based on Red Hat Linux 6.4 operating system. The computer configuration is shown in Table 3, and the number of this computer is 20. All modules are implemented in C++ programming. Therefore, the simulation can be easily extended and ported to other research platforms on various embedded devices. First, proposed algorithms are internally evaluated and compared themselves by means of the operation time and memory consumption indexes. The operation time is the total elapsed time for a workflow application from workflow submission to completion, including the waiting time, and the memory consumption means the memory requirements of the workflow operation. Then, in combination with the HYBD and online workflow management (OWM), 42 the average operation time and scheduling length ratio (SLR) indexes are introduced to show their performance. The SLR is the ratio of a workflow operation time over its best possible scheduling length, which is designed to evaluate the performance of scheduling algorithms without the workflow size variation and defined as the value of dividing the operation time and critical path length (CPL) of a workflow. In addition, the ratio of shortest operation time among all workflows is cited, which is designed for comparing different algorithms and measures the percentage of workflows that achieve the shortest operation time for each evaluated algorithm. 43
Computer configuration.
Evaluation of proposed algorithms
In order to evaluate algorithms in section “Operation of a workflow composed of subflows,” we must combine them with algorithms in section “Scheduling algorithms for ETL workflows” to show the overall performance. Owing to the feasibility and practicality of the MCM algorithm, we put up the minimum-cost and minimum-memory with shortest-subflow-first algorithm (MCMS), the minimum-cost and minimum-memory with priority-backfilling algorithm (MCMP), and the minimum-cost and minimum-memory with shortest-subflow-first and priority-backfilling algorithm (MCMSP). As before, the operation time and memory consumption are chosen as the evaluation index, and the data in Table 2 are used, which are effective and convincing, as shown in Figures 8 and 9.

Average operation time of the MCM, MCMS, MCMP, and MCMSP algorithms.

Average memory consumption (#row packs) of the MCM, MCMS, MCMP, and MCMSP algorithms.
In combination with subflows, operation time savings go up to average 60% from Figure 8 compared with results in Figure 5; especially, the operation efficiency shows better when the amount of the input size becomes bigger. From Figure 9, there is no quantitative relationship between the input size and average memory consumption. The MCMSP algorithm is the best among proposed algorithms, and the MCM with subflow algorithms are better than only MCM algorithm in general. The MCM algorithm is integrated with the SSF and PB algorithms, respectively, which shows the similar simulation results and improves the overall system performance. In a word, the superior performance attributes to proper subflow scheduling algorithms, and we will further research on speeding up workflow division and switching algorithms. In order to further clarify advantages of the MCMSP algorithm, we present an example of three subflows to be scheduled on three clusters of different computing speeds and different numbers of processors. Figure 10 shows that structures of three subflows and the necessary information of each task are described in Table 4, where the numbers after the decimal point are the number of processors, and Table 5 shows the task ranking results based on equations (1) and (2).

Three example subflows.
Task execution time on different clusters.
Task ranking results.
As shown in Figure 11, the width of each part is the number of processors, and the length of each part means the required operation time. Advantages of the MCMSP algorithm are shown in shadows compared with the HYBD algorithm, which mainly includes the SSF and PB algorithm merits. In addition, the operation time and memory consumption with the MCMSP algorithm is clearly released, but in which the CPU utilization keeps high value and cooling conditions must be furnished.

Comparison of the HYBD and MCMSP algorithms.
Comparison of different algorithms
In order to evaluate the effectiveness of proposed methods, we compare them with previous common algorithms HYBD and OWM in literatures. The mean interarrival time is changed to investigate their influences on the performance of proposed algorithms. 45
From Figures 12 and 13, the MCMSP algorithm is the best among proposed algorithms in terms of the average operation time and SLR. The average operation time of HYBD algorithm even exceeds 1600 ms, which is more than the others. The MCMS and MCMP algorithms show similar system performance, which only changes some operation stages in the algorithm. In a word, the average operation time decreases with the mean interarrival time. However, in combination with the ratio of the shortest operation time among workflows, the MCMSP algorithm achieves 100% in most cases and only obtains 85% with a singular value when the mean interarrival time is 500. In addition, during the experiments, the number of backfilling activities is calculated and we find that the PB algorithm usually happens when the system is more crowded with shorter interarrival time, but more occurrences of priority backfilling do not mean superior performance improvement because earlier operations of some tasks do not always release their time if the start times of these tasks on the critical path keep unvaried.

Comparison of average operation time with different algorithms.

Comparison of average SLR with different algorithms.
Robust performance
The operation time plays a great effect on scheduling algorithms, but which usually cannot be known in advance because of uncertainties existing in some applications. Thus, inaccurate operation time of workflows needs to be evaluated, which can be chosen randomly in the following range when the mean interarrival time is set 100 s 42
where the unit task (UT) is the EROT of the unit task. For instance, the chosen EROT is randomly from 1 to 500 when the uncertainty is chosen as 200 and UT is 100. Figures 14 and 15 further show the robust performance with the system uncertainties.

Average operation time with uncertainties in different algorithms.

Average SLR with uncertainties in different algorithms.
From the above figures, the MCMSP algorithm is superior to the others in terms of the average operation time and SLR indexes when the uncertainty varies from 100% to 500%. With the growth of uncertainties, the average operation time and SLR are both increased effectively. The smaller the average operation time and SLR are, the better the system performance shows. In a word, the proposed algorithms can improve the system performance, and the system robustness also can be guaranteed.
Conclusion
In this article, a new framework of a big data platform in CSG is built, of which the integration and fusion play a key role in the system performance. The RR, MC, and MM algorithms are proposed to improve the workflow operation time and memory consumption. The MCM algorithm is developed based on above algorithms, and meanwhile, the workflow is divided into many subflows. The SSF and PB algorithms mainly focus on the waiting queue schedulers of subflows, which are integrated into the MCM algorithm. Experiments are carried out in terms of algorithms themselves which are compared with each other and system robust performance, which prove that the proposed MCMSP algorithm is the best among all algorithms and improves the system performance greatly.
In the future, we may focus on other parts of the PSM, like task prioritizing, rearrangement and allocation. In addition, different styles of much more data will be used to evaluate the effectiveness and stability of the proposed algorithms.
