Sage Journals: Discover world-class research

Abstract

Applications of big data techniques in power system will make contributions to the sustainable development and robust establishment of China Southern Power Grid; thus, it is necessary that a new framework of China Southern Power Grid big data platform is constructed. Apart from key technologies, like data analysis, data process, and data visualization, the integration and fusion problem in the data warehouse plays an important role in the data analysis and mining with high quality. In order to minimize the operation time and memory consumption, various scheduling strategies of extract–transform–load workflows are proposed, including round-robin algorithm, minimum-cost algorithm, minimum-memory algorithm, and mixture of the minimum-cost and minimum-memory algorithm. In combination with above algorithms, a workflow is divided into many subflows by effective algorithms, like shortest-subflow-first and priority-backfilling algorithms, which can further improve the parallel computation ability. Then, the minimum-cost and minimum-memory with shortest-subflow-first algorithm, the minimum-cost and minimum-memory with priority-backfilling algorithm, and the minimum-cost and minimum-memory with shortest-subflow-first and priority-backfilling algorithm are established, which are designed to schedule subflows. Finally, aiming at characteristics of China Southern Power Grid big data, different performance indexes are cited to evaluate above algorithms, and the experiment results show that the minimum-cost and minimum-memory with shortest-subflow-first and priority-backfilling algorithm is superior to the hybrid prioritization algorithm based on the rank level of each task (hybrid), online workflow management, minimum-cost and minimum-memory with shortest-subflow-first, and the minimum-cost and minimum-memory with priority-backfilling algorithm, and the system robust performance is also significantly met and improved.

Keywords

China Southern Power Grid extract–transform–load minimum-cost and minimum-memory shortest-subflow-first priority-backfilling

Introduction

With the global energy problem becoming more and more serious, all countries in the world do research on the smart grid.^1–3 The ultimate goal of the smart grid builds a comprehensive power system covering the whole production process, including power generation, power transmission, power transformation, power distribution, power dispatch, and power consumption, which can be seen as a panoramic view of real-time systems. Supporting smart grid secure, self-healing, green, strong, and reliable operations are based on the acquisition, transmission, and storage of the panoramic view real-time data, as well as the rapid analysis of huge amount of accumulated and multi-source data.⁴ Big data in the smart grid usually show five key characteristics, mainly including volume, velocity, variety, veracity, and value, which also have been widely applied to intelligent transportation,⁵ business,⁶ medicine,⁷ building,⁸ and so on. In order to achieve an integrated collection of the subject-oriented big data in the support of making decision, the term data warehouse (DW) is proposed by Inmon⁹ in the 1990s. Generally, components of a DW are designed as a multi-layer architecture in Figure 1. What should be focused on is the integration of data sources with the extract–transform–load (ETL), which takes charge of the extraction of the data from huge heterogeneous data source, the transformation of these data, and their loading into the DW. Therefore, the appropriate design and maintenance of the ETL processes become key factors in the success of DW projects.^10,11 In order to achieve the optimized project, designers must delve into the inherent amount of complexities of this environment and technical details, who mainly focus on the design of a workflow that extracts data from these sources, cleans any inconsistencies they may have, transforms them according to the target format, and finally loads them into the target data store.¹² Even though many design strategies have been proposed to address ETL processes, guides are not enough in many times. For the convenience of carrying their work out, the subjective decisions are simplified, which lead to serious DW loading performance problems or they only focus on the ETL model optimization.^13–15 Therefore, in combination with existing ETL processes involving many activities organized as a workflow, it is necessary to give a fresher look on workflow schedulers.^16,17 Recent literatures aim to achieve maximum possible parallelism among tasks at a level of a workflow while minimizing the system overheads and resource wastage, which focus on task scheduling strategies.^12–17

Figure 1.

The multi-layer architecture of data warehouse.

ETL processes include many processing workflows, in which there exist certain constraint relationships. How these processing workflows are efficiently scheduled is a key problem in the implementation of ETL, which plays a vital role in improving the development efficiency and source utilization rate of the DW. Workflow matching and scheduling problems can be considered as a non-deterministic polynomial (NP) problem,¹⁸ and it is impossible to achieve the optimal solution to the problem. Conventional algorithms focus on the directed acyclic graph (DAG) of a workflow scheduling, which is different from multi-workflow scheduling of the ETL in the DW. In addition, above methods decrease the operation time by reducing the operation quantity and changing the operation order, which scarcely do research on the allocation and scheduling of ETL activities. A scheduling framework is put forward in the DW, even though the static scheduling, dynamic scheduling, and same layer division are carried out, and the accurate scheduling model and overall algorithm description are left out.¹⁹ A greedy algorithm is also applied to the optimal workflow scheduling, which is limited to only a workflow and cannot guarantee the performance of the multi- workflow condition.²⁰ The derivation mode of the primary table is proposed to optimize the ETL process, and a pipeline optimization method is provided for the ETL operation, which is based on the premise that all ETL activities are constrained serially and lack the generality to some extent.²¹ A possible physical implementation of an ETL workflow is put up, including logical-level description and an appropriate cost model as inputs, but which neglects the workflow operation in detail.²² In order to search for alternative physical implementations with lower cost, this algorithm is extended by intentionally introducing sorting activities in the workflow, but comparative experiments are not shown, including workflow styles and the algorithm itself.²³ These drawbacks further motivate the improvement of ETL workflow schedulers, and efficient ETL operations have become a research topic to achieve the minimum of the ETL operation time and memory consumption.

This article is organized as follows: section “Background” describes the background needed to introduce our application platform, data characteristics, and ETL schedulers in China Southern Power Grid (CSG) big data. Then, the problem formation is proposed in combination with the background in section “Problem formulation.” Section “Scheduling algorithms for ETL workflows” puts up workflow scheduling algorithms for the workflow, including round-robin (RR) algorithm, minimum-cost (MC) algorithm, minimum-memory (MM) algorithm, and mixture of the MC and MM (MCM) algorithm. Aiming at subflows of a workflow, two algorithms are integrated into above algorithms in section “Operation of a workflow composed of subflows.” Finally, experiments are carried out in terms of proposed algorithms each other, comparison of different algorithms, and robustness performance evaluation.

Background

Application framework of CSG big data

Apart from the electrical aspect, the smart gird has become an interesting research topic for data scientists, which is mainly composed of information technology, computer technology, communication technology, transmission, and distribution power infrastructure.² Aiming to achieve steady availability of production control, operation management, status measurement, risk assessment, social economic situation analysis and prediction, and so on, new big data platform applications are being studied.²⁴ An application framework of CSG big data is set up and given in Figure 2. Sources of big data mainly come from energy management system (EMS), distribution management system (DMS), automatic measurement system (AMS), marketing management system (MMS), customer service system (CCS), geographic information system (GIS), weather prediction system (WPS), and social economy data (SED). These complex and huge data need to be preprocessed before transmitted to the core of the smart grid big data platform, and these processes can be called as data integration and fusion which play a crucial role in the ETL process. Among the core of the smart grid big data platform, data analysis, process, and management are involved with many aspects based on specific algorithms. Platform control mainly achieves the monitor, scheduling, management, and backup restore.

Figure 2.

The application framework of CSG big data.

ETL

ETL process

During the ETL process, the valuable big data are extracted from online analytical processing databases, then, transformed to match the DW schema, and finally, loaded into the DW, as shown in Figure 3.^14,25 As data sources are changing, the DW will be periodically updated, and the ETL process is not a one-time event. Therefore, it is concluded that the ETL process must be designed for easy modification; meanwhile, ETL operations can be scheduled properly.²⁶

Figure 3.

The framework of integration and fusion.

The extraction step is responsible for extracting data from the source data, and each data source has its own characteristics which distinctly increase the extraction difficulty. It is necessary to extract them with different methods which are introduced in section “Big data analysis.”²⁷ The transformation step tends to make some cleaning and conforming on the incoming data to gain accurate data which are correct, complete, consistent, and unambiguous. This process includes data selection, separation, union, conversion, and collection, which defines the granularity of fact tables, dimension tables, DW schema, derived facts, slowly changing dimensions, and factless fact tables. All transformation rules and resulting schemas are described in the metadata repository. The final step is loading data which loads data to the target multidimensional structure. In this step, extracted and transformed data are written into the dimensional structures actually accessed by the end users and application systems. Meanwhile, loading step includes both loading dimension tables and loading fact tables.

Big data analysis

Big data are worthless in a vacuum, whose potential value is unlocked only when leveraged to drive the decision. To improve the decision efficiency, these complex data need to be converted into meaningful insights in the limited time and storage. In the following sections, we clarify analytical techniques of CSG big data based on the source data category, named as structured, semi-structured, and unstructured data.²⁸ Structured data in CSG mainly include EMS, DMS, AMS, MMS and CCS information, equipment information, and inventory data, which are stored in the dedicated and relational databases, like Oracle, MySQL, Postgres, and Teradata. Sqoop realizes the data interaction between the dedicated and relational database and Hadoop Distributed File System (HDFS).²⁹ Semi-structured data usually cannot be extracted directly but owns a certain structure itself, including the typical object exchange model (OEM), common knowledge base, word, pdf, email, and so on.³⁰ Flume NG is a distributive, reliable and available treatment method, which is widely applied to the collection, aggregation, and transmission of mass data. With the merit of the structure flexibility, the extraction efficiency is improved by many agents. Unstructured data refer to irregular data which cannot be described in two-dimensional logical table, especially like the monitor video, 95598 service audio, and GIS figure. Here, kettle strategy can adjust different structure data based on transformation and job script. The former realizes the data basis conversion, and the later achieves the overall workflow control.³¹ In order to improve the real-time workflow efficiency, Apache frameworks of Storm, Spark, and Samza have been proposed to deal with complex specific requirements.

ETL workflows

ETL processes are composed of many complex ETL workflows, which are responsible for the DW maintenance. Therefore, optimal workflows must be guaranteed and scheduled properly. In the past, parallelism scheduling updates and queries in the real-time warehouse and the workflow scheduling in data stream management systems are studied.^32–34 It is regrettable that no available and effective information are specially designed for workflows in their internals. Considering offline and batch cases, workflows on the entire ETL process from the source to the warehouse are studied.

In order to analyze the internal mechanism, the Aurora stream manager further divides the scheduling operations into more streams,³⁵ and the control objective minimizes the following criteria, including the operation time, latency time, and memory. A query is operated in a data stream system for the sake of minimizing the required memory.³⁶ The key step of the scheduling is to select an operation path which removes the largest data as soon as possible. Meanwhile, some algorithms are proposed to improve the system operation time.³⁷ ETL workflows can be designed as the stream processing style to achieve the optimization of the operation time and memory consumption. However, ETL workflows include many complex processes, and considering the offline ETL specificity, the optimal objective achieves the overall minimum operation time instead of only transmitting tuples to the ultimate users as soon as possible. At the same time, a workflow is divided into many subflows.

Problem formulation

A DAG models activities and recordsets on the basis of the overall layout of an ETL workflow. The term activities mean any software module that processes the incoming data. The term recordsets refer to any data storage that obeys a schema (like relational tables and record files). Recordsets and activities are logical abstractions of physical entities. Then, the logical-level workflow is refined at the physical level, and a combination of executable scripts that perform the ETL workflow is designed. Finally, each activity of the workflow is physically implemented using various algorithms, which is evaluated in terms of the operation time or memory consumption.

Workflow scheduler

In order to express clearly, we formally model an ETL workflow as a DAG $G (V, E)$ , where $V = V_{A} \cup V_{R}$ , $V_{A}$ means activities of the graph, and $V_{R}$ is the recordset. For the convenience of describing the node status, candidates and finishers are introduced, and then, $V = V_{CAN} \cup V_{FIN}$ . Here, an edge $(v, w) \in E$ is a provider relationship denoting v that w receives data from the node v for further processing, which also can be described as producers $(v)$ and consumers $(w)$ . A genetic ETL activity is described in Figure 4.

Figure 4.

An ETL activity structure.

From Figure 4, $μ (v)$ and $σ_{v}$ are the consumption rate and selectivity of the node v, respectively, and $Q (v)$ and $O (v)$ are the set of all input and output queues of the node v, respectively, ${queue}_{t}^{i}$ is the size of the $i th$ input queue of an activity v at the time t, and $queu e_{t} (v)$ is the sum of all input queue sizes of an activity v at the time t. We choose the T as an infinite countable set of time, then T can be divided into some intervals $T = T_{1} \cup T_{2} \cup \cdot \cdot \cdot$ , and $T_{i} = [T_{i} \cdot start, T_{i} \cdot last]$ , $T_{i} \cdot last = T_{i + 1} \cdot start - 1$ . The consumption rate means the amount of memory gained per time unit in combination with the activity of a node in the workflow, which can be defined as follows

$μ (v) = \frac{Q (v) - O (v)}{T_{v}}$ (1)

where $T_{v}$ is the time interval of the activity of the node v. The selectivity shows how selective the node v is as follows

$σ_{v} = \frac{O (v)}{Q (v)}$ (2)

The memory size of the queue q is defined as $siz e_{t} (q)$ at the time t, and the maximum memory size of all queues at any time is expressed as $MaxM (q)$ . In addition, each source recordset node is defined as $vo l_{t} (v)$ at the time t. Aiming at many ETL workflows, the schedulers must determine which activity should be operated and how long the slot time of the activity’s operation is. The function $active (\cdot)$ is used to return the activity node next. In order to guarantee effective queues, the status of all queues must be evaluated to avoid the loss of overloading queues. The time $T_{i} \cdot last$ means the end of the current operation $active (T_{i})$ . A new node will be assigned when these conditions are met in the following. First, the time slot is exhausted, here, $t = T_{i} \cdot last$ . Second, there is no more input data in input queues, here, $queu e_{t} (active (T_{i})) = 0$ . Third, one of active activity consumers has a full input queue. Whether the node v is moved to $V_{FIN}$ is important and must be judged. For the sake of a node v moved to $V_{FIN}$ , either v is an empty source recordset, here, $v = active (T_{i})$ . In addition, the following conditions must be met: (1) all nodes feeding v with data have exhausted their input and (2) queues of v have been emptied, which can be described as follows

$\begin{matrix} V_{FIN} = V_{FIN} \cup {active (T_{i})} \\ if {\begin{matrix} vo l_{t} (v) = 0 \\ (a) \forall w \in pro (v), w \in V_{FIN} (b) queu e_{T_{i} \cdot last} (v) = 0 \end{matrix} \end{matrix}$ (3)

In combination with above definitions and analysis, the ultimate objective is to find optimal scheduling methods for a workflow, which is described as follows:

The scheduling method divides T into time intervals $T = T_{1} \cup T_{2} \cup \cdot \cdot \cdot \cup T_{last}$ .

$\forall t \in T, v \in V, \forall q \in Q (v), siz e_{t} (q) \leq MaxM (q)$

One of the objective functions is minimized in the following: (1) $T_{last}$ is minimized, where $T_{last}$ is the time interval which the operation workflow ends and (2) $max \sum queu e_{t} (v)$ is minimized, where $t \in T$ and $v \in V$ .

In a word, the above problem focuses on searching for scheduling algorithms, which realizes the minimum of the operation time and maximum memory consumption.

Workflow division

Owing to a large workflow composed of many fragments, we can further divide the workflow into appropriately connected subflows, which will improve the parallel processing ability. In particular, advantages are shown in the multiprocessor and multiserver system. Every subflow has its own property which can make the pipelining of intermediate results between its activities, and it is not necessary to store them in the persistent storage. Thus, a shrinking version of the graph is generated as a side effect. Likewise, subflows become nodes of the shrinking graph. A division of the shrinking graph is composed of a set of disjoint subflows $SF = {F_{1}, F_{2}, \dots, F_{k}}$ and subflow connecting edges. Aiming at a subflow itself, the existence of the entire input needs to be available. Since subflows do not share edges, relative blocking operators are the boundaries between the other subflows. Thus, the subflow graph $G_{S} (V_{S}, E_{S})$ is achieved by the original graph $G (V, E)$ . Nodes of V is replaced with subflow nodes, and each edge of subflows is replaced with an edge $e_{S} (F_{i}, F_{j})$ .

With a subflow graph $G_{S} (V_{S}, E_{S})$ obtained, independent and mutual subflows need to be checked, which can be operated at the same time. A stratification of the subflow graph can be obtained recursively, and an arrangement of subflow nodes to next layers of operation is given as follows:

Strata $S_{0}$ acts as the start of the subflow graph, namely, the sources of the DW.

Strata $S_{i + 1}$ acts as the node of $V_{S}$ which meets the following requirements:

There is one incoming edge from $S_{i}$ .

There may be incoming edges from any stratum $S_{j}$ as long as $j < i + 1$ .

There is no other incoming edges.

The proposed improved sorting algorithms can achieve the above stratification, which are different from common topological structures and described in the following Pang et al.³⁸ Once the strata of the subflow graph have been recognized, each stratum can be activated in its own turn. At this time, subflows with the same strata can operate independently with different scheduling strategies. In combination with the above analysis, ETL workflows are divided into many subflows. In order to decrease the total operation time and memory consumption, relative algorithms are proposed for the ETL and subflows, respectively.

Scheduling algorithms for ETL workflows

Scholars classify existing scheduling algorithms for malleable parallel jobs into three categories,³⁹ including list algorithm, longest processing time algorithm, and optimizing the middle algorithm, which are all based on some criteria. Here, three genetic algorithms are introduced, namely, RR algorithm, MC algorithm, and MM algorithm, which should obey different criteria in Table 1. From Table 1, we find that which activity is favored each time when the scheduler is called, and before a new decision is scheduled, how long the selected activity will continue to operate.

Table 1.

Different criteria of three algorithms.

Algorithm	Pick next	Reschedule when
RR	Operator ID	Input queue is exhausted
MC	Maximum size of input queue	Input queue is exhausted
MM	Maximum consumption rate	Time slot

RR: round-robin; MC: minimum-cost; MM: minimum-memory.

RR

The RR algorithm deals with all activities fairly, which arranges time slices to the relevant activity in an order based on an unique identifier of every activity, and the pseudo-code is as follows

$\begin{matrix} \begin{matrix} Input : V_{CAN} containsactivities \end{matrix} \\ \begin{matrix} Output : NextactivityRR_next \end{matrix} \\ \begin{matrix} 1 & Begin \end{matrix} \\ \begin{matrix} 2 & Return & {V_{CAN}}_{.} \end{matrix} pop \\ \begin{matrix} 3 & End \end{matrix} \end{matrix}$

MC

The MC algorithm is proposed to reduce the operation time of ETL workflows. Therefore, communications among the activities are minimized as few as possible. In addition, all data of the selected activity should be ready for processing, especially, the largest input data number of the activity. Since there is no time slot, the selected activity processes all data from the scheduler in succession. In order to simplify the problem, it is assumed that all activities read data from an external source, which is available for the operation, and the pseudo-code is described as follows

$\begin{matrix} \begin{matrix} Input : V_{CAN} contains activities \end{matrix} \\ \begin{matrix} Output : Next activity MC_next \end{matrix} \\ \begin{matrix} 1 & Begin \end{matrix} \\ \begin{matrix} 2 & Maxinput = - 1 \end{matrix} \\ \begin{matrix} 3 & \begin{matrix} For & v \in V_{CAN} & do \end{matrix} \end{matrix} \\ \begin{matrix} 4 & \begin{matrix} If & \begin{matrix} Maxinput < v_{Q}, & then \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} 5 & \begin{matrix} MC_next = v \end{matrix} \end{matrix} \\ \begin{matrix} 6 & \begin{matrix} Maxinput = v_{Q} \end{matrix} \end{matrix} \\ \begin{matrix} 7 & \begin{matrix} End If \end{matrix} \end{matrix} \\ \begin{matrix} 8 & End for \end{matrix} \\ \begin{matrix} 9 & Return MC_next \end{matrix} \\ \begin{matrix} 10 & End \end{matrix} \end{matrix}$

MM

The MM algorithm schedules ETL workflows to improve the memory arrangement. In general, the MM algorithm consumes the biggest data number in an activity. In practice, the number of data that an activity consumes is the data that the activity removes from the memory, either by rejecting or writing them into a file for a specific portion of time. The smaller are the selectivity, large processing rate, and input size, the better is the scheduling of an activity. An index of the consumption rate in equation (1) is put up, with which the MM selects the activity with the biggest $μ (v)$ at every scheduling step. It is important to note that when the ETL workflow operation starts no activity, the activity with the biggest input size is selected, and the pseudo-code is given as follows

$\begin{array}{l} \begin{matrix} Input : V_{C A N} contains activities \end{matrix} \\ \begin{matrix} Output : Next activity MM_next \end{matrix} \\ \begin{matrix} 1 & Begin \end{matrix} \\ \begin{matrix} 2 & Maxinput = - 1 \end{matrix} \\ \begin{matrix} 3 & μ = - \infty \end{matrix} \\ \begin{matrix} 4 & For & \begin{matrix} v \in V_{C A N} & do \end{matrix} \end{matrix} \\ \begin{matrix} 5 & \begin{matrix} If & μ < v_{μ} & then \end{matrix} \end{matrix} \\ \begin{matrix} 6 & \begin{matrix} MM_next = v \end{matrix} \end{matrix} \\ \begin{matrix} 7 & \begin{matrix} μ = v_{μ} \end{matrix} \end{matrix} \\ \begin{matrix} 8 & End If \end{matrix} \\ \begin{matrix} 9 & If & \begin{matrix} Maxinput < v_{Q} & then \end{matrix} \end{matrix} \\ \begin{matrix} 10 & MC_next = v \end{matrix} \\ \begin{matrix} 11 & Maxinput = v_{Q} \end{matrix} \\ \begin{matrix} 12 & \begin{matrix} End If \end{matrix} \end{matrix} \\ \begin{matrix} 13 & End For \end{matrix} \\ \begin{matrix} 14 & If & \begin{matrix} μ \leq 0 & then \end{matrix} \end{matrix} \\ \begin{matrix} 15 & \begin{matrix} MM_next = MC_next \end{matrix} \end{matrix} \\ \begin{matrix} 16 & End If \end{matrix} \\ \begin{matrix} 17 & End \end{matrix} \end{array}$

MCM

In order to compare the performance of above algorithms, we choose the typical DMS, 95558 with structure analysis and GIS, for example, in CSG, mainly including District 1 (D1), District 2 (D2), and District 3 (D3). The overall amount of above data are collected and shown in Table 2.

Table 2.

Local data collection and statistics.

Data type	Data source	District	Daily amount of data (GB)
Structured data	3-min section file of DMS	D1	0.70
		D2	0.66
		D3	0.55
		Subtotal	1.91
Semi-structured data	95558 file with structure analysis	D1	1.28
		D2	1.03
		D3	0.65
		Subtotal	2.96
Unstructured data	GIS single line drawing	D1	1.57
		D2	1.38
		D3	0.86
		Subtotal	3.81
Total			8.68

In combination with data characteristics, Sqoop, Flume NG, and Kettle are designed to deal with different types of data. In addition, RR, MC, and MM algorithms are applied to show the average operation time and memory consumption in Figures 5 and 6.

Figure 5.

Average operation time of the RR, MC, and MM algorithms.

Figure 6.

Average memory consumption (#row packs) of the RR, MC, and MM algorithms.

Experiment results clearly show that the MC algorithm is faster than the RR algorithm, and the MM algorithm is the lowest among three algorithms in terms of the average operation time, as shown in Figure 5. However, the MM algorithm is better than RR and MC algorithms in terms of the average memory consumption, as shown in Figure 6. Therefore, what we should deal with in practice is to balance the average operation time and memory consumption. Aiming at improving operation efficiency, the memory consumption can be permitted with the development of mass storage devices to some extent, and parallel processing strategies can be introduced. Consequently, workflows are designed to further be divided into subflows to boost the workflows with parallel operations at the expenses of the memory consumption, namely, the MCM algorithms. The MC algorithm is only used in simple workflows, and the MM algorithm is designed to address complex workflows involving memory consuming and blocking operations. The experiment results show that the MCM algorithm is superior to single algorithm, especially when the workflow is complicated and contains many tasks.

Operation of a workflow composed of subflows

The DAG is well known as a workflow model, which not only can reflect data dependency between workflows but also control dependency.⁴⁰ Thus, a subflow also involves this problem in the same way. However, there are few papers that automate the procedure of identifying subflows that can be operated in parallel, which mainly focus on the manual operation relying on designers’ experience.⁴¹ In particular, the operation time must be reduced to guarantee system real-time requirements in the limited memory. In this section, a parallel scheduling mechanism (PSM) is presented in Figure 7.

Figure 7.

Structure of a parallel scheduling mechanism.

As shown in Figure 7, the first step realizes the subflow prioritizing and then determines that some subflows become ready, which mainly includes the finish and arrival event of the subflow. After that, the following steps are carried out serially and then attempt to schedule subflows in the waiting queue based on the availability of the system source data. The PSM adopts simple workflow scheduling (SWS) in the subflow prioritizing stage,⁴² which simply puts each ready subflow into the waiting queue, and gives out two algorithms which are shortest-subflow-first (SSF) and priority-based backfilling (PB).

SSF

The waiting queue scheduling stage in the PSM usually adopts the approach used in hybrid (HYBD) algorithm,⁴² which calculates the rank value of each task according to the definition in Heterogeneous Earliest Finish Time (HEFT) algorithm.⁴³ The definition of rank is given below based on computing each task length of critical path from a task $t_{i}$ to the exit task

$rank (t_{i}) = w_{i} + max_{t_{j} \in succ (t_{i})} (c_{i, j} + rank (t_{j}))$ (4)

where $succ (t_{i})$ is the set of immediate successors of task $t_{i}$ , $c_{i, j}$ is the average communication cost of edge $(i, j)$ among all possible pairs of clusters, and $w_{i}$ is the average computation cost of task $t_{i}$ on all possible clusters. The computation of a rank starts from the exit task and traverses up along the task graph recursively. Thus, the rank is called upward rank, and the rank of the exit task $t_{exit}$ is as follows

$rank (t_{exit}) = w_{exit}$ (5)

In general, the scheduler sorts tasks in the descending order by the rank value when all tasks in the waiting queue come from the same subflow. If there are multiple subflows in the queue, tasks are sorted in the ascending order relying on their rank values. The strategy in Yu and Shi⁴² is the same as the shortest-job-first (SJF) method, but which cannot deal with a subflow with many tasks efficiently because of low-rank tasks with different crossed subflows. Therefore, an algorithm named as SSF should enhance the SJF in the waiting queue scheduling stage to decrease the average turnaround time of all subflows, in which the scheduler calculates the estimation remaining operation time (EROT) of each subflow whenever a new subflow arrives. Then, tasks in the waiting queue are first sorted in the ascending order by the EROT of subflows they belong to. Finally, tasks coming from the same subflow are sorted in the descending order according to their rank values. Each time a task becomes ready, it is simply put into the appropriate position among the same subflow tasks in the waiting queue according to its rank value. The algorithm for the task prioritizing of SSF is described in the equation below, and the QuickSort function with a customized compare function is designed for the waiting queue. The compare function first aims at judging whether tasks belong to the same subflow. If so, the task with higher rank value has higher priority. Otherwise, the task of a subflow with shorter EROT will achieve a higher priority. The algorithm for calculating EROT (Cal_EROT) in SSF first collects ready tasks for the subflow and then finds the ready task with the highest rank subflow. Finally, the task is mapped to the source that produces the minimal EROT and checks whether any descendants of the task become ready until all the tasks have been mapped. The algorithm is described in detail as follows

$\begin{array}{l} Q_{s} \begin{matrix} : Waiting queue . \end{matrix} \\ V_{s} \begin{matrix} : Subflow . \end{matrix} \\ C : Profile of clusters . \\ t_{i}, t_{j} : A task in the waiting queue . \\ AF : An aging factor for adjusting workflow priority . \\ STP : A specified time period for activating the aging mechanism . \\ TRS : The set of tasks ready for execution . \\ TNS : The set of tasks not yet ready . \\ ERT : The estimated remaining time for the subflow' s operation . \\ \begin{matrix} FFJ_Prioritizing (Q_{s}) \end{matrix} \\ \begin{matrix} 1 & Begin \end{matrix} \\ \begin{matrix} 2 & QuickSort (Q_{s}, Compare) \end{matrix}; \\ \begin{matrix} 3 & End \end{matrix} \\ Compare (t_{i}, t_{j}) \\ \begin{matrix} 1 & Begin \end{matrix} \\ \begin{matrix} 2 & s_{i} = the subflow t_{i} belongs to; \end{matrix} \\ \begin{matrix} 3 & s_{j} = the subflow t_{j} belongs to; \end{matrix} \\ \begin{matrix} 4 & \begin{array}{l} d_{i} = Cal_EROT (s_{i}, C) \\ - (current time - submission time of s_{i}) \cdot AF / STP; \end{array} \end{matrix} \\ \begin{matrix} 5 & \begin{array}{l} d_{j} = Cal_EROT (s_{j}, C) \\ - (current time - submission time of s_{j}) \cdot AF / STP; \end{array} \end{matrix} \\ \begin{matrix} 6 & r_{i} = rank \end{matrix} of t_{i}; \\ \begin{matrix} 7 & r_{j} = rank \end{matrix} of t_{j}; \\ \begin{matrix} 8 & \begin{matrix} If & s_{i} = s_{j} & then \end{matrix} \end{matrix} \\ \begin{matrix} 9 & Return (r_{i} > r_{j}) \end{matrix} \\ \begin{matrix} 10 & else \end{matrix} \\ \begin{matrix} 11 & Return (d_{i} < d_{j}) \end{matrix} \\ \begin{matrix} 12 & End If \end{matrix} \\ \begin{matrix} 13 & End \end{matrix} \\ Cal_EROT (V_{s}, C) \\ \begin{matrix} 1 & Begin \end{matrix} \\ \begin{matrix} 2 & TRS = Φ; \end{matrix} \\ \begin{matrix} 3 & TNS = Φ; \end{matrix} \\ \begin{matrix} 4 & ERT = 0 \end{matrix} \\ \begin{matrix} 5 & For each task t_{i} \in V_{s} has' t been allocated for operation do \end{matrix} \\ \begin{matrix} 6 & If all ancestors have been allocated for operation then \end{matrix} \\ \begin{matrix} 7 & \begin{matrix} TRS = TRS \cup {t_{i}}; \end{matrix} \end{matrix} \\ \begin{matrix} 8 & else \end{matrix} \\ \begin{matrix} 9 & \begin{matrix} \begin{matrix} TNS = TNS \cup {t_{i}}; \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} 10 & \begin{matrix} End If \end{matrix} \end{matrix} \\ \begin{matrix} \begin{matrix} 11 & End for \end{matrix} \end{matrix} \\ \begin{matrix} 12 & While & TNS \neq Φ & do \end{matrix} \\ \begin{matrix} 13 & Select t_{i} \in TRS where t_{i} has the highest rank value; \end{matrix} \\ \begin{matrix} 14 & MapToBestResource (t_{i}, C); \end{matrix} \\ \begin{matrix} 15 & Update estimation finish time of t_{i}; \end{matrix} \\ \begin{matrix} 16 & TRS = TRS \ {t_{i}} \end{matrix} \\ \begin{matrix} 17 & If estimation finish time of t_{i} > ERT then \end{matrix} \\ \begin{matrix} 18 & ERT = estimated finish time of t_{i}; \end{matrix} \\ \begin{matrix} 19 & End If \end{matrix} \\ \begin{matrix} 20 & For each descendant t_{j} > ERT then \end{matrix} \\ \begin{matrix} 21 & If none of its ancestors is in TNS then \end{matrix} \\ \begin{matrix} 22 & TNS = TNS \ {t_{j}} \end{matrix} \\ \begin{matrix} 23 & TRS = TRS \ {t_{j}} \end{matrix} \\ \begin{matrix} \begin{matrix} 24 \end{matrix} & End If \end{matrix} \\ \begin{matrix} 25 & End for \end{matrix} \\ \begin{matrix} 26 & End while \end{matrix} \\ \begin{matrix} 27 & Return ERT \end{matrix} \\ \begin{matrix} 28 & End \end{matrix} \end{array}$

PB

Algorithms proposed in Yu and Shi⁴² result in reducing system CPU utilization, in which there exists idle space before additional ones are available because the next task requirements cannot be met by free processors. Therefore, it is necessary to improve the resource utilization and guarantee the overall system performance. Different backfilling strategies are proposed to reduce resource fragmentation by permitting tasks to run out of order as long as they do not delay certain tasks.⁴⁴ Common backfilling rearranges the waiting queue based on the task EROT. In this section, a waiting queue based on the priority is put forward with no more than one profile of every task in a linked list, and the profile mainly owns three attributes, including the estimation start time (EST), estimation finish time (EFT), and estimation allocated cluster (EAC). The improved PB algorithm is presented in the following. First, an empty linked list is initialized for possible profile storage. Second, considering their orders of all tasks in the waiting queue, the first future time instant is found in the linked list when enough resources are available in some clusters, and EST, EFT, EAC, and the linked list are updated. Third, a new task becomes ready and is put into the waiting queue at each time, and the scheduler will recreate the profile and update the estimation information of tasks. Finally, the scheduler allocates tasks in the ascending order of their EST instead of their priorities.

Experimental results and analysis

According to CSG data characteristics, a simulation platform is built based on Red Hat Linux 6.4 operating system. The computer configuration is shown in Table 3, and the number of this computer is 20. All modules are implemented in C++ programming. Therefore, the simulation can be easily extended and ported to other research platforms on various embedded devices. First, proposed algorithms are internally evaluated and compared themselves by means of the operation time and memory consumption indexes. The operation time is the total elapsed time for a workflow application from workflow submission to completion, including the waiting time, and the memory consumption means the memory requirements of the workflow operation. Then, in combination with the HYBD and online workflow management (OWM),⁴² the average operation time and scheduling length ratio (SLR) indexes are introduced to show their performance. The SLR is the ratio of a workflow operation time over its best possible scheduling length, which is designed to evaluate the performance of scheduling algorithms without the workflow size variation and defined as the value of dividing the operation time and critical path length (CPL) of a workflow. In addition, the ratio of shortest operation time among all workflows is cited, which is designed for comparing different algorithms and measures the percentage of workflows that achieve the shortest operation time for each evaluated algorithm.⁴³

Table 3.

Computer configuration.

Name	Model and parameters
CPU	Intel Xeon E5-2692, 16 Core Processor
RAM	32 GB
HDD	300 GB*4, SAS 15 K

Evaluation of proposed algorithms

In order to evaluate algorithms in section “Operation of a workflow composed of subflows,” we must combine them with algorithms in section “Scheduling algorithms for ETL workflows” to show the overall performance. Owing to the feasibility and practicality of the MCM algorithm, we put up the minimum-cost and minimum-memory with shortest-subflow-first algorithm (MCMS), the minimum-cost and minimum-memory with priority-backfilling algorithm (MCMP), and the minimum-cost and minimum-memory with shortest-subflow-first and priority-backfilling algorithm (MCMSP). As before, the operation time and memory consumption are chosen as the evaluation index, and the data in Table 2 are used, which are effective and convincing, as shown in Figures 8 and 9.

Figure 8.

Average operation time of the MCM, MCMS, MCMP, and MCMSP algorithms.

Figure 9.

Average memory consumption (#row packs) of the MCM, MCMS, MCMP, and MCMSP algorithms.

In combination with subflows, operation time savings go up to average 60% from Figure 8 compared with results in Figure 5; especially, the operation efficiency shows better when the amount of the input size becomes bigger. From Figure 9, there is no quantitative relationship between the input size and average memory consumption. The MCMSP algorithm is the best among proposed algorithms, and the MCM with subflow algorithms are better than only MCM algorithm in general. The MCM algorithm is integrated with the SSF and PB algorithms, respectively, which shows the similar simulation results and improves the overall system performance. In a word, the superior performance attributes to proper subflow scheduling algorithms, and we will further research on speeding up workflow division and switching algorithms. In order to further clarify advantages of the MCMSP algorithm, we present an example of three subflows to be scheduled on three clusters of different computing speeds and different numbers of processors. Figure 10 shows that structures of three subflows and the necessary information of each task are described in Table 4, where the numbers after the decimal point are the number of processors, and Table 5 shows the task ranking results based on equations (1) and (2).

Figure 10.

Three example subflows.

Table 4.

Task execution time on different clusters.

	A1.4	A2.5	A3.5	A4.6	B1.7	B2.5	B3.7	B4.6	B5.5	B6.6	C1.5	C2.7	C3.4	C4.7	C5.4	C6.9	C7.8
C1	10	15	5	15	15	10	5	20	5	5	15	20	15	15	25	15	5
C2	15	5	15	15	15	15	5	10	15	5	15	5	20	5	10	20	5
C3	5	10	25	15	30	5	5	15	10	5	15	5	10	10	10	25	5

Table 5.

Task ranking results.

	A1	A2	A3	A4	B1	B2	B3	B4	B5	B6	C1	C2	C3	C4	C5	C6	C7
Rank	75	40	50	15	75	45	50	35	25	5	105	80	55	35	25	20	5

As shown in Figure 11, the width of each part is the number of processors, and the length of each part means the required operation time. Advantages of the MCMSP algorithm are shown in shadows compared with the HYBD algorithm, which mainly includes the SSF and PB algorithm merits. In addition, the operation time and memory consumption with the MCMSP algorithm is clearly released, but in which the CPU utilization keeps high value and cooling conditions must be furnished.

Figure 11.

Comparison of the HYBD and MCMSP algorithms.

Comparison of different algorithms

In order to evaluate the effectiveness of proposed methods, we compare them with previous common algorithms HYBD and OWM in literatures. The mean interarrival time is changed to investigate their influences on the performance of proposed algorithms.⁴⁵

From Figures 12 and 13, the MCMSP algorithm is the best among proposed algorithms in terms of the average operation time and SLR. The average operation time of HYBD algorithm even exceeds 1600 ms, which is more than the others. The MCMS and MCMP algorithms show similar system performance, which only changes some operation stages in the algorithm. In a word, the average operation time decreases with the mean interarrival time. However, in combination with the ratio of the shortest operation time among workflows, the MCMSP algorithm achieves 100% in most cases and only obtains 85% with a singular value when the mean interarrival time is 500. In addition, during the experiments, the number of backfilling activities is calculated and we find that the PB algorithm usually happens when the system is more crowded with shorter interarrival time, but more occurrences of priority backfilling do not mean superior performance improvement because earlier operations of some tasks do not always release their time if the start times of these tasks on the critical path keep unvaried.

Figure 12.

Comparison of average operation time with different algorithms.

Figure 13.

Comparison of average SLR with different algorithms.

Robust performance

The operation time plays a great effect on scheduling algorithms, but which usually cannot be known in advance because of uncertainties existing in some applications. Thus, inaccurate operation time of workflows needs to be evaluated, which can be chosen randomly in the following range when the mean interarrival time is set 100 s⁴²

$Range = [1, UT \cdot (1 + 2 \cdot Uncertainty)]$ (6)

where the unit task (UT) is the EROT of the unit task. For instance, the chosen EROT is randomly from 1 to 500 when the uncertainty is chosen as 200 and UT is 100. Figures 14 and 15 further show the robust performance with the system uncertainties.

Figure 14.

Average operation time with uncertainties in different algorithms.

Figure 15.

Average SLR with uncertainties in different algorithms.

From the above figures, the MCMSP algorithm is superior to the others in terms of the average operation time and SLR indexes when the uncertainty varies from 100% to 500%. With the growth of uncertainties, the average operation time and SLR are both increased effectively. The smaller the average operation time and SLR are, the better the system performance shows. In a word, the proposed algorithms can improve the system performance, and the system robustness also can be guaranteed.

Conclusion

In this article, a new framework of a big data platform in CSG is built, of which the integration and fusion play a key role in the system performance. The RR, MC, and MM algorithms are proposed to improve the workflow operation time and memory consumption. The MCM algorithm is developed based on above algorithms, and meanwhile, the workflow is divided into many subflows. The SSF and PB algorithms mainly focus on the waiting queue schedulers of subflows, which are integrated into the MCM algorithm. Experiments are carried out in terms of algorithms themselves which are compared with each other and system robust performance, which prove that the proposed MCMSP algorithm is the best among all algorithms and improves the system performance greatly.

In the future, we may focus on other parts of the PSM, like task prioritizing, rearrangement and allocation. In addition, different styles of much more data will be used to evaluate the effectiveness and stability of the proposed algorithms.

Footnotes

Academic Editor: Teen-Hang Meen

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research,authorship,and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research,authorship,and/or publication of this article: The work of this paper is partially supported by the China Southern Power Grid key project (K-KY2014-035).

References

Fang

Misra

Xue

GL.

Smart grid—the new and improved power grid: a survey. IEEE Commun Surv Tutor 2012; 14: 944–980.

Diamantoulakis

Kapinas

Karagiannidis

GK.

Big data analytics for dynamic energy management in smart grids. Big Data Res 2015; 2: 94–101.

Shyam

Bharathi Ganesh

Sachin Kumar

. Apache spark a big data analytics platform for smart grid. Proced Technol 2015; 21: 171–178.

Zhang

Tang

Zha

. Application of advanced power electronics in smart grid. Proc CSEE 2010; 30: 1–7.

Wang

Zhou

. Soft computing in big data intelligent transportation systems. Appl Soft Comput 2016; 38: 1099–1108.

Longo

Giacovelli

Bochicchio

MA.

Fact—centered ETL: a proposal for speeding business analytics up. Proced Technol 2014; 16: 471–480.

Teixeira

Annibal

Felipe

. A similarity-based data warehousing environment for medical images. Comput Biol Med 2015; 66: 190–208.

Kang

Hong

CH.

A study on software architecture for effective BIM/GIS-based facility management data integration. Automat Constr 2015; 54: 25–38.

Inmon

WH.

Building the data warehouse. Wellesley, MA: QED Press; New York: John Wiley & Sons, 1992.

10.

Solomon

Ensuring a successful data warehouse initiative. Inform Syst Manage 2005; 22: 26–36.

11.

March

Hevner

Integrated decision support systems: a data warehousing perspective. Decis Support Syst 2007; 43: 1031–1043.

12.

Vassiliadis

Simitsis

Terrovitis

. Blueprints and measures for ETL workflows. In: Proceedings of the 24th international conference on conceptual modeling, Klagenfurt, 24–28 October 2005. Berlin: Springer.

13.

Romero

Mazón

Trujillo

. Quality of data warehouses. In: Liu

Tamer Özsu

(eds) Encyclopedia of database systems. New York: Springer, 2009, pp.2230–2235.

14.

Ali El-Sappagh

Ahmed Hendawi

El Bastawissy

. A proposed model for data warehouse ETL processes. J King Saud Univ: Comput Inf Sci 2011; 23: 91–104.

15.

Bergamaschi

Guerra

Orsini

. A semantic approach to ETL technologies. Data Knowl Eng 2011; 70: 717–731.

16.

Naghibzadeh

Modeling and scheduling hybrid workflows of tasks and task interaction graphs on the cloud. Future Gener Comp Sy 2016; 65: 33–45.

17.

Sahni

Vidyarthi

DP.

Workflow-and-platform aware task clustering for scientific workflow execution in cloud environment. Future Gener Comp Sy 2016; 64: 61–74.

18.

Kasahara

Narita

Practical multiprocessor scheduling algorithms for efficient parallel processing. IEEE T Comput 1984; 33: 1023–1029.

19.

Shi

Bao

Liu

. On task scheduling strategy in data warehouse system. Control Decis 2005; 20: 109–112.

20.

Wang

Chen

Research on task scheduling method based on the greedy algorithm in ETL. Microelectron Comput 2009; 26: 130–133.

21.

Han

Dong

YS.

Optimization of ETL execution by pipelining method. Mini-Micro Syst 2005; 26: 134–138.

22.

Vasiliki

Panos

Alkis

. Deciding the physical implementation of ETL workflows, In: International Workshop on Dolap, Lisbon, 9 November 2007, pp. 49–56.

23.

Santos

Belo

Modeling ETL data quality enforcement tasks using relational algebra operators. Proced Technol 2013; 9: 442–450.

24.

Bush

Goel

Simard

IEEE vision for smart grid communications: 2030 and beyond roadmap. New York: IEEE Standard Association, 2013, pp.1–19.

25.

Ta’a

Abdullah

MS.

Goal-ontology approach for modeling and designing ETL processes. Proced Comput Sci 2011; 3: 942–948.

26.

Simitsis

Vassiliadis

Sellis

TK.

Optimizing ETL processes in data warehouses. In: Proceedings of the 21st international conference on data engineering (ICDE), Tokyo, Japan, 5–8 April 2005, pp.564–575. Washington, DC: IEEE Computer Society.

27.

Bergamaschi

Sartori

Guerra

. Extracting relevant attribute values for improved search. IEEE Internet Comput 2007; 11: 26–35.

28.

Gandomi

Haider

Beyond the hype: big data concepts, methods, and analytics. Int J Inform Manage 2015; 35: 137–144.

29.

Kim

Chung

Choi

. Cost-based join processing scheme in a hybrid RDBMS and hive system. In: Proceedings of the international conference on big data and smart computing, Bangkok, Thailand, 15–17 January 2014. New York: IEEE.

30.

Skoutas

Simitsis

Ontology-based conceptual design of ETL processes for both structured and semi-structured data. Int J Semant Web Inf 2007; 3: 1–24.

31.

Fan

Han

Liu

Challenges of big data analysis. Natl Sci Rev 2014; 1: 293–314.

32.

Ewen

Iterative parallel data processing with stratosphere: an inside look. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD), New York, 23–27 June 2013. New York: ACM.

33.

Rabl

Jacobsen

Big data generation. In: Proceedings of the workshop on big data benchmarking, San Diego, CA, 8–9 May 2012. New York: Springer.

34.

Murray

DG.

Naiad: a timely dataflow system. In: Proceedings of the 24th ACM symposium on operating system principles, Farmington, PA, 3–6 November 2013. New York: ACM.

35.

Carney

Cetintemel

Rasin

. Operator scheduling in a data stream manager. In: Proceedings of the 29th international conference on very large data bases (VLDB), Berlin, 12–13 September 2003, pp.838–849. New York: ACM.

36.

Babcock

Babu

Datar

. Chain: operator scheduling for memory minimization in data stream systems. In: Proceedings of the ACM international conference on management of data (SIGMOD), San Diego, CA, 9–12 June 2003, pp.253–264. New York: ACM.

37.

Urhan

Franklin

. Dynamic pipeline scheduling for improving interactive query performance. In: Proceedings of the 27th international conference on very large data bases (VLDB), Rome, 11–14 September 2001, pp.501–510. New York: ACM.

38.

Pang

Wang

Cheng

. Topological sorts on DAGs. Inform Process Lett 2015; 115: 298–301.

39.

Fan

Zhang

Wang

. An effective approximation algorithm for the malleable parallel task scheduling problem. J Parallel Distr Com 2012; 72: 693–704.

40.

Epema

DHJ

Naghibzadeh

Abrishami

Deadline-constrained workflow scheduling algorithms for Infrastructure as a Service Clouds. Future Gener Comp Sy 2013; 29: 158–169.

41.

Dayal

Castellanos

Simitsis

. Data integration flows for business intelligence. In: Proceedings of the 12th international conference on extending database technology: advances in database technology (EDBT), Saint-Petersburg, Russia, 23–26 March 2009, pp.1–11. New York: ACM.

42.

Shi

. A planner-guided scheduling strategy for multiple workflow applications. In: Proceedings of the international conference on parallel processing—workshops (ICPP-W), Portland, OR, 8–12 September 2008, pp.1–8. New York: IEEE.

43.

Topcuoglu

Hariri

Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE T Parall Distr 2002; 13: 260–274.

44.

Mualem

Feitelson

DG.

Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE T Parall Distr 2001; 12: 529–543.

45.

Sinnen

Task scheduling for parallel systems. Hoboken, NJ: Wiley Interscience, 2007.

	A1.4	A2.5	A3.5	A4.6	B1.7	B2.5	B3.7	B4.6	B5.5	B6.6	C1.5	C2.7	C3.4	C4.7	C5.4	C6.9	C7.8
C1	10	15	5	15	15	10	5	20	5	5	15	20	15	15	25	15	5
C2	15	5	15	15	15	15	5	10	15	5	15	5	20	5	10	20	5
C3	5	10	25	15	30	5	5	15	10	5	15	5	10	10	10	25	5

	A1.4	A2.5	A3.5	A4.6	B1.7	B2.5	B3.7	B4.6	B5.5	B6.6	C1.5	C2.7	C3.4	C4.7	C5.4	C6.9	C7.8
C1	10	15	5	15	15	10	5	20	5	5	15	20	15	15	25	15	5
C2	15	5	15	15	15	15	5	10	15	5	15	5	20	5	10	20	5
C3	5	10	25	15	30	5	5	15	10	5	15	5	10	10	10	25	5

Research and realization of improved extract–transform–load scheduler in China Southern Power Grid

Abstract

Keywords

Introduction

Background

Application framework of CSG big data

ETL

ETL process

Big data analysis

ETL workflows

Problem formulation

Workflow scheduler

Workflow division

Scheduling algorithms for ETL workflows

RR

MC

MM

MCM

Operation of a workflow composed of subflows

SSF

PB

Experimental results and analysis

Evaluation of proposed algorithms

Comparison of different algorithms

Robust performance

Conclusion

Footnotes

Declaration of conflicting interests

Funding

References

	A1.4	A2.5	A3.5	A4.6	B1.7	B2.5	B3.7	B4.6	B5.5	B6.6	C1.5	C2.7	C3.4	C4.7	C5.4	C6.9	C7.8
C1	10	15	5	15	15	10	5	20	5	5	15	20	15	15	25	15	5
C2	15	5	15	15	15	15	5	10	15	5	15	5	20	5	10	20	5
C3	5	10	25	15	30	5	5	15	10	5	15	5	10	10	10	25	5