Sage Journals: Discover world-class research

Abstract

The Internet of Things (IoT) operates solely on local interactions among its components, which include various devices with communications capabilities. Because the IoT is a fully distributed computing network, it is important to mitigate any negative effects resulting from faults occurring in its components and to provide sustainable services. This paper focuses on fault management of IoT services. In particular, it handles a fault management scheme for the self-organizing software platform (SoSp), a platform on which IoT services connected to various IT devices are deployed. The proposed fault management scheme enables SoSp to provide situation aware IoT services without loss of data and state.

1. Introduction

As a fully distributed computing network, the Internet of Things (IoT) operates solely on local interactions among its components. There is no central server to monitor its components, which include mobile devices, or to trace all the states of the local interactions among the components because IoT interactions involve numerous ubiquitous devices and services. However, IoT services are expected to be reliable and available continuously, even when faults arise [1]. With more services provided in fixed data networks being adapted to the IoT as a result of people's preference for mobile and ubiquitous services, expectations for QoS in IoT are increasing [2]. Therefore, it is important to mitigate the negative effects of faults in IoT components and to provide sustainable IoT services in spite of any faults that arise [3]. On distributed computing domains, real-time applications are at greater risk of data loss and service interruptions due to faults in the domain.

Nowadays, many IT devices associated with well-being are being launched to satisfy the modern desire to properly manage health at ordinary times. Marketing is primarily aimed at the aged in aging society, who often have devices such as sphygmomanometers, pill reminders, and biosignal measuring equipment such as thermometers and electrocardiogram monitors at home. The desire for health care services goes beyond family doctor visits to regular handling of emergency to save life and the periodic checking of one's health status. Current medical services are inconvenient for patients such as the disabled and the aged because they are forced to visit hospitals to get medical treatment despite the difficulty they experience moving around. Thus, the ability to rapidly handle emergencies and ordinary health checks is essential.

The self-organizing software platform (SoSp) is an IoT platform that provides IoT services for health care and life convenience. The health care services deployed on the SoSp network, which are connected to well-being IT devices, quickly notify family members or doctors of an emergency when persons are in danger of passing a critical stage. Further, doctors can check the health status of their patients outside of the hospital by remote controlling of the relevant devices via SoSp network.

Printing services and resource-finding services based on user location are just two examples of the convenience provided by IoT services [4].

This paper focuses on failure handling in the SoSp domain, in which various IoT services are deployed. Because failure of the SoSp infrastructure can cause IoT services to stop and their critical states to be missed, fault monitoring and fault recovery in the SoSp are very important. Through fault management, the SoSp can provide sustainable services without loss of data and state.

The remainder of this paper is organized as follows: Section 2 discusses previous works related to fault management and IoT. Section 3 presents a brief review of the SoSp and its requirements for fault management of IoT services. The proposed fault management scheme applied to the SoSp is presented and analyzed in Section 4. The contribution is discussed in Section 5. Finally, concluding remarks and plans for future work are outlined in Section 6.

2. Related Work

Fault tolerance in multihop wireless networks is currently being studied [5, 6]. Research is also being conducted on localized geometrical structures those that utilize the rich geometric properties of wireless networks [6–8] and dynamic cluster techniques [9–13]. Further, a solution has been proposed to the problem of message loss and excessive delays during lookups in structured overlays [14]. Much research is focused on large-scale P2P networks and multihop wireless networks. In contrast, research on self-organizing network allowing one-hop wireless communication is limited.

Research to address loss of coherence in IoT, which arises from desynchronization between objects and can lead to errors and failures, is also actively underway. In one such research effort, an overlay of logical checkpoints at the application layer is used to define links between the coherent states of a set of objects and trigger resynchronization messages [15]. However, although synchronization is very important in the IoT domain, the focus of this paper is on the management of failure so as to provide continuous IoT services using small primary service groups which are the subsets of the IoT service groups in the SoSp domain.

Su et al. [1] proposed a fault tolerance mechanism for the WuKong middleware using strip to store a list of duplicated services, with each service peer maintaining a consistent view of duplicated services in the strip. However, the WuKong middleware is designed for sensor networks and the WuKong master manages all WuDevices and their services. Further, the proposed fault tolerance mechanism does not consider failure of the master and the gateway, only failure of the WuDevices.

The scheme proposed in this paper copes with failures in resources, self-organizing software platform Routers (SoSpRs), and mobile devices together.

3. Self-Organizing Software Platform for IoT Services

Only local interactions between nodes are defined in the SoSp domain. The states of the interactions or services in SoSp are neither traced nor managed, as other distributed computing domains. As a result, SoSp is highly scalable and flexible. However, its fault detection and fault recovery capabilities are deficient. The SoSp platform can detect faults in its components eventually and recover by utilizing its self-organization and self-adaptation characteristics. However, it takes some time to search for replacements for the faulty components and resume the services that were interrupted. Consequently, it loses the states of these services during the ensuing time gap. Real-time services in particular can lose data and their service states.

The data is delivered by via a pull/push publish/subscribe pattern [16] in the SoSp domain. Because the SoSp network is an IoT network, there are requests and replies and delivery of the requests or the replies to destinations between the nodes in the SoSp network, in which a node means a device with communication capabilities. There are two types of core devices in SoSp: mobile SoSpR and fixed SoSpR. (SoSpR is an abbreviation for SoSp-Router.) As an identifiable node, SoSpR can communicate with other SoSpRs and provide many services. Each fixed SoSpR is attached to the ceiling or the wall of a unit space and communicates with the devices in that unit space. It also communicates with other fixed SoSpRs via the wired network. However, it interacts with mobile SoSpRs installed on mobile devices such as smartphones by one-hop wireless communication. A mobile SoSpR is composed of an eight-bit MCU, a 4 KB SRAM, and a coin battery. A fixed SoSpR comprises an Arm Cortex A8 MCU with 512 MB SDRAM, an 8 GB SD card, and an IEEE 802.15.4 transceiver. Node failure is a state in which the node cannot communicate or when the node cannot provide the service being served even though it can communicate.

A typical health care IoT services scenario is as follows. A person wearing a PAAR watch and WBAN-Hub [17] reaches a state of emergency, which is transmitted to the medical team or the family doctor preappointed on the ER button on the PAAR watch. Assuming that the doctor requests the electrocardiography (ECG) signal of the person via WBAN-Hub, the ECG signal is displayed in real- time on smartphone, tablet, or PC/TV through the streaming service installed in the SoSp network.

The IoT printing service scenario proceeds as follows. When a user with a mobile device that has communication capabilities requests a printing service using a printer near to him/her, any printer around him/her services his/her request. However, during printing, the working printer or the SoSpR related to the printing service or the mobile device that delivered the printing request may become faulty.

Figure 1 shows various IoT services in the SoSp domain: health management services S1 with R1, S4 with R4, S5 with R5, and S6 with R6, as well as life convenience services S2 with R2 and S3 with R3. The figure shows several services in the SoSp network with nodes including mobile and fixed SoSpRs and resources related to the services. The services provided include real-time biosignal streaming and display, printing at a nearby printer, monitoring of the patient, or the disabled and emergency notification and handling. The resources and the mobile nodes can be moved during the services. Their connection to the fixed SoSpRs can also be changed.

Figure 1

IoT services in the SoSp domain.

Figure 2 shows the architecture of a fixed SoSpR. In a unit space, the fixed SoSpR communicates with mobile devices, robots, and other devices that request or provide some services. A user can easily request any indoor location based service (LBS) [18, 19] from the physical resources with a mobile device (such as a watch, smartphone, or smart pad) using its wireless communication function (e.g., WiFi, Bluetooth). The mobile devices or robots always communicate with the SoSpR in the unit space using the LIDx protocol [20, 21], which provides real-time localization and the ability to transfer asynchronous messages among numerous mobile devices and robots. Neighbor list shows a list of neighbor fixed SoSpRs which are represented SRs in Figure 1. Binding list shows the services deployed in the SoSpR [18].

Figure 2

SoSpR SW architecture.

In the SoSp domain, rapid detection of faults and fault tolerant services without loss of data and state are very important. Because real-time services need service continuity, SoSp should support them by overcoming the faults. The fault manager in Figure 2 monitors and traces the states of the service being served and the entities related to the service. With fault recovery based on the information gained from monitoring, SoSp can provide seamless services and prevent loss of data and state.

4. Fault Monitoring and Fault Recovery on SoSp Domain

4.1. Definition

Services in the SoSp domain are defined by service agents. A service agent S is defined as a tuple S = (name, self-introduction, M, N, C, H), where name is the unique name of the service, self-introduction is a self-descriptive string for the service, M is a set of addresses of mobile nodes that subscribe to the service, N is a set of addresses of stationary nodes over which the service is distributed, C is a set of service contexts, and H is a service handler (name of the compiled service code) [22]. A context is a pair C = (T, V), where T is a nonempty set of terminologies and V is a nonempty set of values such as “bool,” “int,” “float,” “string,” and “date.” Consequently, a voting service can be represented as $\begin{array}{l} S = (“ v o t i n g ”, “ c h e c k s t a t u s ”, [100, 101, 102], \\ [“ R 501 ”, “ R 502 ”], [\{“ p e r i o d ”, 30\}], \\ “ v o t i n g . b e a m ”) . \end{array}$ (1)

S should be synchronized among the service agents in the same service cluster except for the context, C; depending on the requirements of the service, context C can be managed individually. Changes in the lists of mobile nodes that subscribe to S and that of fixed SoSpRs that deploy S are synchronized between the fixed SoSpRs on which S is installed. The service area of S can be expanded and contracted dynamically according to the locations of the mobile nodes subscribing to S, as shown in Figure 3 [23].

Figure 3

Service group configuration in the SoSp domain.

The same services construct a service group. In Figure 3, the SRs which represent SoSpRs in a service area become a service group S. As a subset of a service group, a primary service group $S_{M}$ is constructed by selecting several SRs in the service group against the local failures based on the (mobile) devices which deliver the users' requests as shown in Figure 4.

Figure 4

Primary service group for fault tolerant SoSp domain.

In Figure 5, each SR has its LIDx_List which is a list of the mobile nodes which register and bind it. The state manager in Figure 2 manages all the states of the services in the primary groups to which its SR belongs. For example, the state of S3 related to M3, $S_{T} [S 3_{M 3}]$ , is defined as follows: $\begin{matrix} S_{T} [S 3_{M 3}] = (“ r e q u e s t ”, “ s t a t u s ”, “ T ”, [M 3], [“ S R 10 ”]) . \end{matrix}$ (2) $S_{T} [S 3_{M 3}]$ shows “request” from M3 is served by the service deployed at SR10. If “request” is composed of several small tasks, r1, r2, and r3, “status” shows which task is being served from the time T. When a task finishes, “status” and “T” are changed. $S_{T} [S 3_{M 3}]$ is known to all the SRs in the primary group $S 3_{M 3}$ by synchronization until $S 3_{M 3}$ is disbanded. If there is a failure, the SRs in the $S 3_{M 3}$ check “status” and “T” and resume the stopped service in the primary service group.

Figure 5

Fault monitoring and fault recovery scheme in the SoSp domain.

4.2. Assumptions

Figure 4 illustrates the faulty states of several IoT services in the SoSp network and the primary service group for ensuring sustainable services in the SoSp domain.

This paper assumes the following. (i)

Mobile nodes and resources are not considered to move fast and frequently enough to connect and disconnect to very many fixed SoSpRs in unit time [24].

(ii)

Only fail stop is considered; malicious failures are not considered.

4.3. Fault Monitoring and Fault Recovery by Primary Service Group

Figure 5 shows the fault monitoring and fault recovery scheme proposed. In the proposed scheme, a primary service group is set up with several fixed SoSpRs that provide the same service. In Figure 4, M3 connects to SR9—the closest fixed SoSpR to M3—and requests service S3 to print to a nearby printer. SR10 serves S3 because S3 has printer R3 installed. To provide fault tolerance for SR10, SR9, and R3, SR10 sets up a primary service group, $S 3_{M 3}$ , with SR9 and SR6; SR9 and SR6 are chosen because they are sufficiently close to SR10 that S3 can recover from any failure. The members of $S 3_{M 3}$ share the service status and monitor any failure in $S 3_{M 3}$ . The number of primary service group members varies according to the characteristic of the service. For example, the number of members in $S 3_{M 3}$ is three. SR6 can substitute for SR10 in case of failure of R3 because SR6 can provide a print service using printer R7. However, if R4 or R5 or R6 fails, there is no substitute for their resource. Therefore, S4, S5, or S6 considers only the failure of SR; $S 4_{M 1}$ is composed of SR11 and SR12.

Figure 5 illustrates the fault recovery process for any failure of SR9, R3, SR10, and M3. In $S 3_{M 3}$ , SR10 provides M3 S3 and shares the service status of $S 3_{M 3}$ with SR9 and SR6 at the same time. Further, the members of $S 3_{M 3}$ monitor for any failure in the group.

4.4. Analysis

The proposed scheme is expected to have strong points in its efficiency and simple implementation.

The proposed scheme focuses on how to cope with the faults of IoT services. The services would be fault tolerant by substituting the abnormal things with the normal ones with equivalent capabilities in near space as soon as possible. The implementation will be concentrated on how to find the substitutes quickly and to continue providing the services transparently in the IoT environments.

With simple state management of IoT services by the primary service group, the proposed scheme does not require any amount of memory or computing resources. Because the IoT services such as printing in a unit space have a short time from a request to its response, there is no much memory to keep the printing states to handle any faulty situation. After the request is served, the states saved are deleted because they are useless. The primary group can be set up at the time of IoT service deployment and be activated or deactivated according to the service requests; there is no need to set up a primary service group and to disband it at every request/response.

SoSpRs, where the IoT services are deployed, have enough capacity in memory to save the service states temporarily and in processor to execute the proposed scheme in near unit spaces.

On the other hand, the communication overhead may occur in the state check of IoT service and the synchronization of the state information between the members of a primary service group. The overhead to check the current status of the IoT services may be proportional to the number of the states. If the states to be managed are few, the overhead would be negligible. When the number of the service states increases, the messages for the state check would be piggybacked to the messages such as LIDx beacon and its response [18]. The communication overhead for the synchronization process is charged to SoSpRs. The messages for the synchronization can also be absorbed in the normal communication between SoSpRs; SoSpRs communicate with each other frequently to configure the distribution of IoT devices in their unit spaces.

5. Contributions

(i)

Fault monitoring and fault recovery scheme for SoSp that prevents abnormal service termination and service discontinuity due to failure of SoSpR, mobile device, and various resources is proposed. SoSp, a distributed IoT computing platform, is not able to quickly recognize the fault states and react quickly to minimize loss of data or service state. The proposed fault tolerant scheme for the SoSp domain provides suitable tolerance according to the type of faults and ensures a seamless service so as to enhance the QoS of SoSp services.

(ii)

Primary service groups are temporarily constructed based on the local interactions between mobile entities and SoSpR entities. The service group provides fault tolerance optimized for SoSp and does not affect the scalability or the flexibility of SoSp. The grouping saves time and energy in providing continuous IoT services in spite of faults [25].

(iii)

Service reliability enhanced by the proposed fault tolerance scheme shows that SoSp can be an effective infrastructure for IoT services such as health care services for the aged, patients, and the disabled and life convenience services.

(iv)

The primary service group adapts the state change of the group members or the service. When there is no need for the primary service group, it is disbanded and the information that maintained is deleted. Because only the fault recovery information is kept in the group, the group does not require a large amount of memory.

6. Conclusion and Future Work

This paper focused on service discontinuity and abnormal service termination that may occur due to the failure of the SoSp infrastructure on which IoT services are installed. Consequently, a fault monitoring and recovery scheme that ensures rapid recognition of fault status and quick reaction to minimize loss of data and service states was proposed. In the proposed scheme, a primary service group monitors all the entities related to the service served locally that can fail. The primary service group adapts the state change of the group members or the service. Because only the information for fault recovery is kept in the group temporarily, the group does not require a large amount of memory.

In the future, the fault tolerant structure proposed will be implemented in SoSp. In addition, QoS enhancement of the services connected to the fast entities in SoSp will be studied with a view to ensure more reliable services in the SoSp domain.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was financially supported by the Ministry of Education (MOE) and National Research Foundation of Korea (NRF) through the Human Resource Training Project for Regional Innovation (no. 2013H1B8A2032298).

References

P. H.

Shih

C.-S.

Hsu

J. Y.-J.

Lin

K.-J.

Wang

Y.-C.

Decentralized fault tolerance mechanism for intelligent IoT/M2M middleware

Proceedings of the IEEE World Forum on Internet of Things (WF-IoT ‘14)

March 2014

45 50

10.1109/wf-iot.2014.6803115

2-s2.0-84900434177

Peng

A secure network for mobile wireless service

Journal of Information Processing Systems 2013 9 2 247 258

10.3745/JIPS.2013.9.2.247

2-s2.0-84883040938

Malkawi

M. I.

The art of software systems development: reliability, availability, maintainability, performance (RAMP)

Human-Centric Computing and Information Sciences 2013 3, article 22

10.1186/2192-1962-3-22

Cho

Choi

Personal mobile album/diary application development

Journal of Convergence 2014 5 1 32 37

Guerraoui

Handurukande

S. B.

Huguenin

Kermarrec

A.-M.

le Fessant

Riviere

Gosskip, an efficient, fault-tolerant and self organizing overlay using gossip-based construction and skip-lists principles

Proceedings of the IEEE International Conference on Peer-to-Peer Computing

2006

Cambridge, UK

12 22

Wang

Cao

Dahlberg

T. A.

Shi

Self-organizing fault-tolerant topology control in large-scale three-dimensional wireless networks

ACM Transactions on Autonomous and Adaptive Systems 2009 4 3, article 19

10.1145/1552297.1552302

2-s2.0-70349933157

Halpern

J. Y.

Bahl

Wang

Y.-M.

Wattenhofer

A cone-based distributed topology-control algorithm for wireless multi-hop networks

IEEE/ACM Transactions on Networking 2005 13 1 147 159

10.1109/tnet.2004.842229

2-s2.0-15544363212

Wattenhofer

Bahl

Wang

Y.-M.

Distributed topology control for power efficient operation in multihop wireless ad hoc networks

Proceedings of the 20th Annual Joint Conference of the IEEE Computer and Communications Societies (InfoCom ‘01)

April 2001

1388 1397

2-s2.0-0035009294

Das

Bharghavan

Routing in ad-hoc networks using minimum connected dominating sets

Proceedings of the IEEE International Conference on Communications (ICC ‘97)

June 1997

376 380

2-s2.0-0030689578

10.

Stojmenovic

Seddigh

Zunic

Dominating sets and neighbor elimination-based broadcasting algorithms in wireless networks

IEEE Transactions on Parallel and Distributed Systems 2002 13 1 14 25

10.1109/71.980024

2-s2.0-0036377260

11.

Alzoubi

X.-Y.

Wang

Wan

P.-J.

Frieder

Geometric spanners for wireless ad hoc networks

IEEE Transactions on Parallel and Distributed Systems 2003 14 4 408 421

10.1109/TPDS.2003.1195412

2-s2.0-0037957308

12.

Bao

Garcia-Luna-Aceves

J. J.

Topology management in ad hoc networks

Proceedings of the 4th ACM International Symposium on Mobile Ad Hoc Networking and Computing (MobiHoc ‘03)

June 2003

129 140

2-s2.0-0242612019

13.

Wang

X.-Y.

Efficient distributed low-cost backbone formation for wireless networks

IEEE Transactions on Parallel and Distributed Systems 2006 17 7 681 693

10.1109/TPDS.2006.86

2-s2.0-33745686736

14.

Galuba

Aberer

Despotovic

Kellerer

Self-organized fault-tolerant routing in peer-to-peer overlays

Proceedings of the 3rd IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO ‘09)

September 2009

30 39

10.1109/saso.2009.14

2-s2.0-73649108613

15.

Cherrier

Ghamri-Doudane

Y. M.

Lohier

Roussel

Fault-recovery and coherence in internet of things choreographies

Proceedings of the IEEE World Forum on Internet of Things (WF-IoT ‘14)

March 2014

IEEE

532 537

10.1109/wf-iot.2014.6803224

2-s2.0-84900447035

16.

Ibrahim

Mohammad

Alagar

Publishing and discovering context-dependent services

Human-Centric Computing and Information Sciences 2013 3 1, article 1

10.1186/2192-1962-3-1

17.

Kang

H.-Y.

Jeong

S.-Y.

Ahn

C.-S.

Park

Y.-J.

Kang

S.-J.

Self-organizing middleware platform based on overlay network for real-time transmission of mobile patients vital signal stream

The Journal of Korea Information and Communications Society 2013 38 7 630 642

18.

Jeong

S. Y.

H. G.

Kang

S. J.

Remote service discovery and binding architecture for soft real-time QoS in indoor location-based service

Journal of Systems Architecture 2014 60 9 741 756

10.1016/j.sysarc.2014.01.008

2-s2.0-84893651359

19.

Kim

Chang

A grid-based cloaking area creation scheme for continuous LBS queries in distributed systems

Journal of Convergence 2013 4 1 23 30

20.

Lee

D.-K.

Kim

T.-H.

Jeong

S.-Y.

Kang

S.-J.

A three-tier middleware architecture supporting bidirectional location tracking of numerous mobile nodes under legacy WSN environment

Journal of Systems Architecture 2011 57 8 735 748

10.1016/j.sysarc.2011.05.004

2-s2.0-79960673097

21.

Gohar

Koh

S. J.

A network-based handover scheme in HIP-based mobile networks

Journal of Information Processing Systems 2013 9 4 651 659

10.3745/jips.2013.9.4.651

2-s2.0-84892168405

22.

Kim

T. H.

H. G.

Jeong

S. Y.

Kang

S. J.

A middleware architecture for dynamic reconfiguration of agent collaboration spaces in indoor location-aware applications

International Journal of Distributed Sensor Networks 2014 2014 18

782928

2-s2.0-84899506141

10.1155/2014/782928

23.

Hong

Chang

A new k-NN query processing algorithm based on multicasting-based cell expansion in location-based services

Journal of Convergence 2013 4 2

24.

Lim

Lee

A simulation model of object movement for evaluating the communication load in networked virtual environments

Journal of Information Processing Systems 2013 9 3 489 498

10.3745/JIPS.2013.9.3.489

2-s2.0-84886779264

25.

Sinha

Lobiyal

D. K.

Performance evaluation of data aggregation for cluster-based wireless sensor network

Human-Centric Computing and Information Sciences 2013 3, article 13

Design of a Situation Aware Service for Internet of Things

Abstract

1. Introduction

2. Related Work

3. Self-Organizing Software Platform for IoT Services

4. Fault Monitoring and Fault Recovery on SoSp Domain

4.1. Definition

4.2. Assumptions

4.3. Fault Monitoring and Fault Recovery by Primary Service Group

4.4. Analysis

5. Contributions

6. Conclusion and Future Work

Footnotes

Conflict of Interests

Acknowledgments

References