Sage Journals: Discover world-class research

Abstract

In distributed object systems, application level fault tolerance is often attained by appropriate object replication policies. These policies aim at increasing the exhibited service availability by masking potential faults that do not recur after recovery. Existing middleware support infrastructures allow customizing object replication properties. However, since fault tolerance has a significant impact in the perceived service performance, there is a need for a suitable quantitative design technique, which allows comparing different replication policies by trading off the caused overhead cost against the achieved fault-tolerance effectiveness. We are also interested in taking into account different concerns in a combined manner (e.g. fault tolerance combined with load balancing and multithreading). This paper presents experimental evidence for the most important performance tradeoffs revealed in a simulation-based study. We considered different cases of object request loss behavior for the faulty objects, as well as, a number of request-retry strategies. The experiments took place in two different application workload levels for varied fault detection settings. We provide results for the combined effects of the studied replication policies with two specific load-balancing strategies. The presented results constitute a valuable experience report for performance tuning object replication policies for application level fault tolerance.

Keywords

Fault-tolerance software replication dependable systems

Get full access to this article

View all access options for this article.