Abstract
1. Introduction
As a fully distributed computing network, the Internet of Things (IoT) operates solely on local interactions among its components. There is no central server to monitor its components, which include mobile devices, or to trace all the states of the local interactions among the components because IoT interactions involve numerous ubiquitous devices and services. However, IoT services are expected to be reliable and available continuously, even when faults arise [1]. With more services provided in fixed data networks being adapted to the IoT as a result of people's preference for mobile and ubiquitous services, expectations for QoS in IoT are increasing [2]. Therefore, it is important to mitigate the negative effects of faults in IoT components and to provide sustainable IoT services in spite of any faults that arise [3]. On distributed computing domains, real-time applications are at greater risk of data loss and service interruptions due to faults in the domain.
Nowadays, many IT devices associated with well-being are being launched to satisfy the modern desire to properly manage health at ordinary times. Marketing is primarily aimed at the aged in aging society, who often have devices such as sphygmomanometers, pill reminders, and biosignal measuring equipment such as thermometers and electrocardiogram monitors at home. The desire for health care services goes beyond family doctor visits to regular handling of emergency to save life and the periodic checking of one's health status. Current medical services are inconvenient for patients such as the disabled and the aged because they are forced to visit hospitals to get medical treatment despite the difficulty they experience moving around. Thus, the ability to rapidly handle emergencies and ordinary health checks is essential.
The self-organizing software platform (SoSp) is an IoT platform that provides IoT services for health care and life convenience. The health care services deployed on the SoSp network, which are connected to well-being IT devices, quickly notify family members or doctors of an emergency when persons are in danger of passing a critical stage. Further, doctors can check the health status of their patients outside of the hospital by remote controlling of the relevant devices via SoSp network.
Printing services and resource-finding services based on user location are just two examples of the convenience provided by IoT services [4].
This paper focuses on failure handling in the SoSp domain, in which various IoT services are deployed. Because failure of the SoSp infrastructure can cause IoT services to stop and their critical states to be missed, fault monitoring and fault recovery in the SoSp are very important. Through fault management, the SoSp can provide sustainable services without loss of data and state.
The remainder of this paper is organized as follows: Section 2 discusses previous works related to fault management and IoT. Section 3 presents a brief review of the SoSp and its requirements for fault management of IoT services. The proposed fault management scheme applied to the SoSp is presented and analyzed in Section 4. The contribution is discussed in Section 5. Finally, concluding remarks and plans for future work are outlined in Section 6.
2. Related Work
Fault tolerance in multihop wireless networks is currently being studied [5, 6]. Research is also being conducted on localized geometrical structures those that utilize the rich geometric properties of wireless networks [6–8] and dynamic cluster techniques [9–13]. Further, a solution has been proposed to the problem of message loss and excessive delays during lookups in structured overlays [14]. Much research is focused on large-scale P2P networks and multihop wireless networks. In contrast, research on self-organizing network allowing one-hop wireless communication is limited.
Research to address loss of coherence in IoT, which arises from desynchronization between objects and can lead to errors and failures, is also actively underway. In one such research effort, an overlay of logical checkpoints at the application layer is used to define links between the coherent states of a set of objects and trigger resynchronization messages [15]. However, although synchronization is very important in the IoT domain, the focus of this paper is on the management of failure so as to provide continuous IoT services using small primary service groups which are the subsets of the IoT service groups in the SoSp domain.
Su et al. [1] proposed a fault tolerance mechanism for the WuKong middleware using strip to store a list of duplicated services, with each service peer maintaining a consistent view of duplicated services in the strip. However, the WuKong middleware is designed for sensor networks and the WuKong master manages all WuDevices and their services. Further, the proposed fault tolerance mechanism does not consider failure of the master and the gateway, only failure of the WuDevices.
The scheme proposed in this paper copes with failures in resources, self-organizing software platform Routers (SoSpRs), and mobile devices together.
3. Self-Organizing Software Platform for IoT Services
Only local interactions between nodes are defined in the SoSp domain. The states of the interactions or services in SoSp are neither traced nor managed, as other distributed computing domains. As a result, SoSp is highly scalable and flexible. However, its fault detection and fault recovery capabilities are deficient. The SoSp platform can detect faults in its components eventually and recover by utilizing its self-organization and self-adaptation characteristics. However, it takes some time to search for replacements for the faulty components and resume the services that were interrupted. Consequently, it loses the states of these services during the ensuing time gap. Real-time services in particular can lose data and their service states.
The data is delivered by via a pull/push publish/subscribe pattern [16] in the SoSp domain. Because the SoSp network is an IoT network, there are requests and replies and delivery of the requests or the replies to destinations between the nodes in the SoSp network, in which a node means a device with communication capabilities. There are two types of core devices in SoSp: mobile SoSpR and fixed SoSpR. (SoSpR is an abbreviation for SoSp-Router.) As an identifiable node, SoSpR can communicate with other SoSpRs and provide many services. Each fixed SoSpR is attached to the ceiling or the wall of a unit space and communicates with the devices in that unit space. It also communicates with other fixed SoSpRs via the wired network. However, it interacts with mobile SoSpRs installed on mobile devices such as smartphones by one-hop wireless communication. A mobile SoSpR is composed of an eight-bit MCU, a 4 KB SRAM, and a coin battery. A fixed SoSpR comprises an Arm Cortex A8 MCU with 512 MB SDRAM, an 8 GB SD card, and an IEEE 802.15.4 transceiver. Node failure is a state in which the node cannot communicate or when the node cannot provide the service being served even though it can communicate.
A typical health care IoT services scenario is as follows. A person wearing a PAAR watch and WBAN-Hub [17] reaches a state of emergency, which is transmitted to the medical team or the family doctor preappointed on the ER button on the PAAR watch. Assuming that the doctor requests the electrocardiography (ECG) signal of the person via WBAN-Hub, the ECG signal is displayed in real- time on smartphone, tablet, or PC/TV through the streaming service installed in the SoSp network.
The IoT printing service scenario proceeds as follows. When a user with a mobile device that has communication capabilities requests a printing service using a printer near to him/her, any printer around him/her services his/her request. However, during printing, the working printer or the SoSpR related to the printing service or the mobile device that delivered the printing request may become faulty.
Figure 1 shows various IoT services in the SoSp domain: health management services S1 with R1, S4 with R4, S5 with R5, and S6 with R6, as well as life convenience services S2 with R2 and S3 with R3. The figure shows several services in the SoSp network with nodes including mobile and fixed SoSpRs and resources related to the services. The services provided include real-time biosignal streaming and display, printing at a nearby printer, monitoring of the patient, or the disabled and emergency notification and handling. The resources and the mobile nodes can be moved during the services. Their connection to the fixed SoSpRs can also be changed.

IoT services in the SoSp domain.
Figure 2 shows the architecture of a fixed SoSpR. In a unit space, the fixed SoSpR communicates with mobile devices, robots, and other devices that request or provide some services. A user can easily request any indoor location based service (LBS) [18, 19] from the physical resources with a mobile device (such as a watch, smartphone, or smart pad) using its wireless communication function (e.g., WiFi, Bluetooth). The mobile devices or robots always communicate with the SoSpR in the unit space using the LIDx protocol [20, 21], which provides real-time localization and the ability to transfer asynchronous messages among numerous mobile devices and robots. Neighbor list shows a list of neighbor fixed SoSpRs which are represented SRs in Figure 1. Binding list shows the services deployed in the SoSpR [18].

SoSpR SW architecture.
In the SoSp domain, rapid detection of faults and fault tolerant services without loss of data and state are very important. Because real-time services need service continuity, SoSp should support them by overcoming the faults. The fault manager in Figure 2 monitors and traces the states of the service being served and the entities related to the service. With fault recovery based on the information gained from monitoring, SoSp can provide seamless services and prevent loss of data and state.
4. Fault Monitoring and Fault Recovery on SoSp Domain
4.1. Definition
Services in the SoSp domain are defined by service agents. A service agent S is defined as a tuple S = (name, self-introduction, M, N, C, H), where name is the unique name of the service, self-introduction is a self-descriptive string for the service, M is a set of addresses of mobile nodes that subscribe to the service, N is a set of addresses of stationary nodes over which the service is distributed, C is a set of service contexts, and H is a service handler (name of the compiled service code) [22]. A context is a pair C = (T, V), where T is a nonempty set of terminologies and V is a nonempty set of values such as “bool,” “int,” “float,” “string,” and “date.” Consequently, a voting service can be represented as
S should be synchronized among the service agents in the same service cluster except for the context, C; depending on the requirements of the service, context C can be managed individually. Changes in the lists of mobile nodes that subscribe to S and that of fixed SoSpRs that deploy S are synchronized between the fixed SoSpRs on which S is installed. The service area of S can be expanded and contracted dynamically according to the locations of the mobile nodes subscribing to S, as shown in Figure 3 [23].

Service group configuration in the SoSp domain.
The same services construct a service group. In Figure 3, the SRs which represent SoSpRs in a service area become a service group S. As a subset of a service group, a primary service group

Primary service group for fault tolerant SoSp domain.
In Figure 5, each SR has its LIDx_List which is a list of the mobile nodes which register and bind it. The state manager in Figure 2 manages all the states of the services in the primary groups to which its SR belongs. For example, the state of S3 related to M3,

Fault monitoring and fault recovery scheme in the SoSp domain.
4.2. Assumptions
Figure 4 illustrates the faulty states of several IoT services in the SoSp network and the primary service group for ensuring sustainable services in the SoSp domain.
This paper assumes the following.
Mobile nodes and resources are not considered to move fast and frequently enough to connect and disconnect to very many fixed SoSpRs in unit time [24]. Only fail stop is considered; malicious failures are not considered.
4.3. Fault Monitoring and Fault Recovery by Primary Service Group
Figure 5 shows the fault monitoring and fault recovery scheme proposed. In the proposed scheme, a primary service group is set up with several fixed SoSpRs that provide the same service. In Figure 4, M3 connects to SR9—the closest fixed SoSpR to M3—and requests service S3 to print to a nearby printer. SR10 serves S3 because S3 has printer R3 installed. To provide fault tolerance for SR10, SR9, and R3, SR10 sets up a primary service group,
Figure 5 illustrates the fault recovery process for any failure of SR9, R3, SR10, and M3. In
4.4. Analysis
The proposed scheme is expected to have strong points in its efficiency and simple implementation.
The proposed scheme focuses on how to cope with the faults of IoT services. The services would be fault tolerant by substituting the abnormal things with the normal ones with equivalent capabilities in near space as soon as possible. The implementation will be concentrated on how to find the substitutes quickly and to continue providing the services transparently in the IoT environments.
With simple state management of IoT services by the primary service group, the proposed scheme does not require any amount of memory or computing resources. Because the IoT services such as printing in a unit space have a short time from a request to its response, there is no much memory to keep the printing states to handle any faulty situation. After the request is served, the states saved are deleted because they are useless. The primary group can be set up at the time of IoT service deployment and be activated or deactivated according to the service requests; there is no need to set up a primary service group and to disband it at every request/response.
SoSpRs, where the IoT services are deployed, have enough capacity in memory to save the service states temporarily and in processor to execute the proposed scheme in near unit spaces.
On the other hand, the communication overhead may occur in the state check of IoT service and the synchronization of the state information between the members of a primary service group. The overhead to check the current status of the IoT services may be proportional to the number of the states. If the states to be managed are few, the overhead would be negligible. When the number of the service states increases, the messages for the state check would be piggybacked to the messages such as LIDx beacon and its response [18]. The communication overhead for the synchronization process is charged to SoSpRs. The messages for the synchronization can also be absorbed in the normal communication between SoSpRs; SoSpRs communicate with each other frequently to configure the distribution of IoT devices in their unit spaces.
5. Contributions
Fault monitoring and fault recovery scheme for SoSp that prevents abnormal service termination and service discontinuity due to failure of SoSpR, mobile device, and various resources is proposed. SoSp, a distributed IoT computing platform, is not able to quickly recognize the fault states and react quickly to minimize loss of data or service state. The proposed fault tolerant scheme for the SoSp domain provides suitable tolerance according to the type of faults and ensures a seamless service so as to enhance the QoS of SoSp services. Primary service groups are temporarily constructed based on the local interactions between mobile entities and SoSpR entities. The service group provides fault tolerance optimized for SoSp and does not affect the scalability or the flexibility of SoSp. The grouping saves time and energy in providing continuous IoT services in spite of faults [25]. Service reliability enhanced by the proposed fault tolerance scheme shows that SoSp can be an effective infrastructure for IoT services such as health care services for the aged, patients, and the disabled and life convenience services. The primary service group adapts the state change of the group members or the service. When there is no need for the primary service group, it is disbanded and the information that maintained is deleted. Because only the fault recovery information is kept in the group, the group does not require a large amount of memory.
6. Conclusion and Future Work
This paper focused on service discontinuity and abnormal service termination that may occur due to the failure of the SoSp infrastructure on which IoT services are installed. Consequently, a fault monitoring and recovery scheme that ensures rapid recognition of fault status and quick reaction to minimize loss of data and service states was proposed. In the proposed scheme, a primary service group monitors all the entities related to the service served locally that can fail. The primary service group adapts the state change of the group members or the service. Because only the information for fault recovery is kept in the group temporarily, the group does not require a large amount of memory.
In the future, the fault tolerant structure proposed will be implemented in SoSp. In addition, QoS enhancement of the services connected to the fast entities in SoSp will be studied with a view to ensure more reliable services in the SoSp domain.
