Abstract
Introduction
Public transportation networks are being consistently indicated as a key player to ensure sustainable, affordable, and high-quality mobility in urban areas. 1 In reality, much has still to be done to ensure a high level of system performance when providing safe and comfortable transport services to its customers. High level of system performance comes from carefully planned operations and a high availability of the asset during operations (reliability of operations, disturbances, and disruptions) as well as outside operations (workforce for maintenance and repairs, availability of extra vehicles to run the planned services). Keeping a high level of service encompasses these three aspects which are conflicting, with direct impact on capital-intensive issues. 2
This article presents a case study to support decisions by a public transport operator, Haagsche Tramweg Maatschappij (HTM), which manages and operates tram and light rail systems in the region of The Hague, The Netherlands. The main decision to be tackled is to balance maintenance operations (cost and workforce required) versus system performance, expressed in terms of reliability and availability. The maintenance actions of HTM follow a pre-determined
We focus on the braking system of light rail rolling stock (Alstom Citadis), more details in Appendix 1. For light rail rolling stock, it is currently the task of the manufacturer to determine a maintenance policy on forehand, based on reliability, availability, maintainability, and safety (RAMS) specifications and based on the degradation, failures, and repairs expected in the warranty period. Maintenance tasks and intervals were never evaluated, and it is suspected (based on experience and gut feelings, not on quantitative indicators) that they are very conservative, resulting in a high workshop load and high maintenance costs. The goal of this article is to report on a test case on dedicated
The main contribution of this article is the application to a relevant test case of a comprehensive approach used to determine a PM policy and comparison against the current state of practice. To the best of our knowledge, no similar test case bridging data-driven failure modelling and data-driven maintenance modelling has been reported yet in the literature, despite its very high practical interest. The approach proposed goes through the following systematic steps:
Based on available historical data of past repairs and maintenance actions, optimized maintenance policies can be determined, which allow consistent savings compared to current approaches, while keeping the operational performance higher than current state of practice. We schematize the combination of data-driven system modelling, expert knowledge, and company strategies and objectives in Figure 1. We remark that due to the confidentiality of the data, we can only report aggregated and relative improvements, concerning the company objectives.

Combination of approaches identified to determine the optimal maintenance policy.
The article is organized as follows. Section ‘Literature review on PM’ reports on the existing literature. Section ‘Modelling failure functions’ describes the possibilities given by data-driven approaches for modelling failures and definition of maintenance actions. Section ‘Modelling PM’ determines the maintenance model, and proposes an optimization scheme targeting a set of key performance indicators, which is evaluated in section ‘Results and discussion’. Section ‘Conclusion and recommendations’ concludes the research, with directions for future research. This article is based on Kraijema. 4
Literature review on PM
Applications of PM to rolling stock systems have often analysed and targeted a single subsystem rather than a holistic perspective of the overall system (air conditioning system 5 and door systems 6 ). We also point to the reader to Giacco 7 for a longer discussion of maintenance issues in rolling stock. In general, PM targets avoiding the system failure and keeps the system available and reliably working for longer periods. In fact, PM actions are proactive in nature and are performed before the systems fail. There are various models of the effects of PM actions on the overall performance of the system; this section gives a short overview of the possible models. For systems where a condition based monitoring system cannot be easily set up, most research efforts are currently directed to better quantifying the intervals for PM and modelling the impact of different maintenance levels and maintenance operations. We call this a PM policy: the set of maintenance actions that are to be performed on specific components or subsystems of a technical system. It also defines the maintenance intervals for these actions.8,9
There are three types of models that can be used to describe the relation between the failure behaviour of a system and components (SC) and the applied maintenance policy:
All these modelling approaches use the failure distribution function to predict the SC’s reliability. The distinctive difference is found in the way the assumed failure behaviour continues after PM actions are performed. PM actions influence the failure times of system and components (SC). PM actions are described in literature in three typical classifications: perfect, minimal, and imperfect. 10 Perfect PM actions restore the system to an optimal state, that is, the reliability of the system is increased to the ‘As Good As New’ (AGAN) level. Minimal PM actions restore the system to a state comparable to the state just before the maintenance actions were performed, that is, the reliability is increased by a minimal amount. This is referred to as ‘As Bad As Old’ (ABAO). Most PM actions performed in real life are neither perfect nor minimal. These in-between actions are often referred to as imperfect PM. In general, imperfect maintenance models can be grouped into following groups: age reduction models, 11 hazard rate reduction models, 12 combined age-hazard reduction models, 13 and others. 14 Detailed overview of different PM policies can be found in Pham and Wang 15 and Wang. 16
Cheng et al.
17
proposed a linear PM model that optimizes the PM intervals between preventive replacements by minimizing the cost while maintaining a certain minimum level of reliability. The systems reliability is derived from a Weibull failure rate distribution. An improvement factor
Coria et al.
20
proposed a maintenance model with a more general relation between PM interval and failure rate. Weibull parameters can thus be estimated from real-life failure time data of a system that has been maintained from day one. Imperfect PM actions are performed at fixed intervals
Modelling failure functions
Available data of recorded failures and identification of subsystem
We start from a database of about 2200 failure, repairs, and maintenance actions which span 5 years between 1 January 2010 and 31 December 2014. In this period, a uniform maintenance policy has been used, for the entire fleet of light rail rolling stock. Burn-in of the vehicles is neglected as the vehicles started operations in 2006. Instead, a period of 2 weeks (estimated with the help of the maintenance engineer (HTM, personal communications, 2015)) is considered after each maintenance action to avoid considering burn-in or imperfect repairs. We filtered the dataset as to not consider censorship, by restricting to a set of records where maintenance actions were anyway performed at fixed intervals, and have been numerous, for all vehicles considered. Relaxing this assumption only needs different methods for estimation of failure rate. 21
Among the subsystems, the braking system has been selected as the most relevant one, being responsible for more than a third of failures, costs, and downtime (see Appendix 1). This has a direct impact on the maintenance policy of the rolling stock systems. The braking system is functionally divided into the four following subsystems: brake control, hydraulics, magnetic track, and electro-dynamic (ED) braking. This latter is excluded from the study as no failure has ever been recorded.
Failure data for the remaining three subsystems are available from different sources: vehicle diagnostics system, driver input, work from inspection, and work order data. Each of these data sources were used and crosschecked to get detailed failure records. The information provided on the corrective work orders will be used to derive the distance to failures for each of the components in the braking system. The mean distance between failures (MDBF) of components in the braking system (the equivalent for transport units of the mean time between failures (MTBF), with the distance covered replacing the time elapsed) can be derived more accurately based on the position of the failed component. When this data are not reported in the computerized maintenance management system (CMMS), information might be given by the mechanic as a remark on the repair work order.
For each of the subsystems, a standard Weibull distribution was used to characterize the failure rate; the scale and shape factors of Weibull distributions are determined from recorded failures to model their failure behaviour. Due to the fact that most failure repairs are imperfect or not effective at all, those distributions cannot be fitted right away.
Failure rate modelling
The failure rate distribution is modelled by means of Weibull distribution, where travelled distance was used instead of time. 22 The distance between failures of the components in the braking system is also influenced by the current PM policy. Formally, the failure probability density function is defined as
where
The data give sufficient evidence that the failure rates of three subsystems are almost constant.
Reliability related to maintenance actions
Three PM actions considered in this model are:
The effects that these PM actions have on the reliability of the SC are defined by two improvement factors
with
where
The improvement factors can be defined as a function of some
where
Tsai et al.
3
optimize the system for availability, which also involves determining the relation between reliability and the maintenance actions performed. We here briefly introduce the key relations between those three concepts. The system-level reliability is defined using the Advisory Group on the Reliability of Electronic Equipment (AGREE) method,26,27 assuming that the general system can be decomposed into a series of independent SCs, in this case the four braking subsystems. This leads to the expression of the reliability of the system over time, where
The system-level availability in stage
where MUT is the mean uptime of the system defined as
And MDT is the mean downtime of the system defined as
where
Modelling PM
Input parameters
The impact of maintenance actions on the failure behaviour is described by multiplying the failure probability with the improvement probability associated with the associated PM action per failure mechanism. We use the available failure records to estimate the failure probabilities per failure mechanism. The improvement probabilities of associated PM actions are instead estimated using expert opinion. However, before estimating the improvement probabilities of PM actions, it is necessary to define the content of PM actions. Table 1 gives an overview of the maintenance tasks that are assumed to be performed when a PM action is applied to the subsystem, and Table 2 gives an overview of the related parameters to each subsystems and failure cause.
Identification of subsystems and related PM actions.
PM: preventive maintenance.
Parameter values for subsystems, PM actions, and failure probabilities (FP).
PM: preventive maintenance; FP: failure probabilities.
For each subsystem, we identified up to four common failure causes, which make up the majority of the failures and give the results of the failure cause analysis. We report in Table 2 the failure probability (as recorded in the CMMS) per cause and subsystem, and the estimated improvement factors associated with the PM actions, determined with help of maintenance experts from HTM. Here,
PM model
We can finally determine the link between maintenance intervals and performance indicators: total costs, availability, and reliability. The total
The components of equation (11) that are related to costs are defined using the CMMS system by deriving the internal hourly rates and spare parts costs.
Based on the expression of
The optimal interval
The
where
The values of PM times
Tsai et al. 3 define the minimum reliability at the system level; while we define the minimum reliability at subsystem level and as a function of the risk associated with the failure of the subsystem. The risk of failure is calculated by relating the probability of occurrence with the impact on corporate objectives. Combining the calculated risk values with the minimum reliability, which is set by company policies to 0.85, the minimum reliability of each subsystem was calculated. The AGREE method is used, as presented in equation (7). The values of the system probability failure due to a specific subsystem can also be derived from CMMS: those are, respectively, 0.42, 0.44 and 0.14 for the brake control system, the hydraulic brake system, and the magnetic track brake.
Maintenance interval optimization
The optimization of the maintenance policy relates to choosing the interval of maintenance between maintenance stages
The algorithm is shown in Figure 2. The possible time intervals for maintenance (time_interval) are scanned in sequence within a range {min_time_interval; max_time_interval}. Given a time interval, the timing of PM maintenance is given. To determine the actions chosen, for all stages

Pseudocode of the optimization approach.
The reliability, availability, and the costs are then assessed at system level. The time_interval which leads to the minimum costs, maximum availability, or maximum reliability is, respectively, selected. As final output of the optimization model, the sequence of maintenance actions is outputted, as well as the evolution of the performance indicators over time. The model is implemented in MATLAB R2014b and reports quickly the optimal maintenance interval, as well as the sequence of maintenance actions. Depending on the amount of intervals evaluated, the entire optimization takes between 5 and 45 min of computation time on a standard computer.
Model validation and sensitivity analysis
To validate the PM model of section ‘PM model’, a benchmark PM model was created by setting the system-level maintenance interval
Relative difference in availability, PM cost, and CM cost between the calculated and recorded values.
PM: preventive maintenance.
To determine the impact of changes, uncertainties, and error in estimation in the input parameters on the output of the model, we have performed a sensitivity analysis. Precisely, we evaluated the influence of the improvement factors (
Parameters and variations considered in the sensitivity analysis.
PM: preventive maintenance.
The sensitivity analyses for availability and total maintenance costs are reported as tornado charts in Figure 3. The reference value of output parameters is given by the vertical axis. The blue/red bars represent the output parameter value for the low/high input parameter values, respectively.

Model sensitivity: availability (left) and maintenance costs (right), in percentage of base case.
Looking at the results of sensitivity analysis for availability, it can be observed that
Results and discussion
Results
We determine quantitatively the best maintenance intervals and their impact to the company objectives and the results correspond to the general intuition. Figure 4 shows how the maintenance interval has significant impact on the PM costs. The costs and intervals are reported scaled down to the benchmark policy currently implemented. Looking only at PM costs, the minimum is found for a maintenance interval which is about 90% longer. Even though a reduction of the maintenance interval significantly increases the reliability of the system, it also increases the PM cost (and reduces the availability, due to the continuous maintenance visits). This relationship is rather regular. A similar behaviour in cost increase is found, much stronger, for long maintenance intervals. After a certain threshold, PM costs jump significantly because high-level PM actions are required to recover the system to the desired reliability levels, and high-level PM actions yield higher maintenance costs. The erratic behaviour for very long maintenance intervals is due to the interaction of failures and the extensive repairs needed when maintenance is performed.

Total maintenance cost as a function of the maintenance interval, relative to the benchmark.
We now discuss the relationship between maintenance intervals and the performance indicators. Figure 5 shows the relation between maintenance intervals and reliability. The maintenance interval is reported relative to the benchmark, while the reliability is reported in absolute number. Two curves are plotted: the one reporting the mean reliability given a maintenance interval (

System-level reliability as a function of the maintenance interval.
It is evident from the figure how the
Figure 6 reports on direct connections between reliability, availability, and costs. Allowing the subsystem reliability to drop significantly below the minimum reliability

Relative maintenance cost (top) and relative availability (bottom) as a function of the minimum reliability for maintenance intervals between 30% and 180% of the benchmark interval.
The results in the previous sections show that the total maintenance cost and system availability are highly dependent upon the minimum reliability requirements of the system. Both availability and cost driven maintenance policies allow the reliability of the system to drop below the accepted minimum. Table 5 gives the values of the key parameters for the availability-, cost-, and reliability-driven maintenance policies, relative to the current situation.
Overview of performance indicators of the optimized maintenance policies.
PM: preventive maintenance.
Discussion
Compared with the benchmark figures of the current policy, both availability and cost driven maintenance policies show a potential maintenance cost reduction of 30%. They moreover maximize the system availability without sacrificing the current mean and minimum reliability; the maintenance intervals are almost 100% longer. The reliability-driven maintenance policy shows a potential increase of 20% in average system reliability and 48% of the minimum system reliability, without compromising the availability of the system or raising the total maintenance cost significantly. Increasing the reliability is related to achieve higher customer satisfaction. Interesting directions for integral assessment of maintenance in transport systems direct towards the level of service, by means of quantifying uncertain travel time and delay in operations 28 or in evaluating the societal costs of delay as perceived by the different users.
A key feature of the system is the strong relation between a reduction in the time required for maintenance (related to workforce employed, and quality of maintenance actions) and an improved availability of the system. Quality of maintenance actions is very important: the cost increase due to reduction of
Conclusion and recommendations
This article describes a case study of optimized maintenance policy for a light rail braking system, achieving great insight using readily available work order data, synthesizing the approach from few separated works in the reliability literature. We use a data-driven approach to determine the failure rates for a specific subsystem, integrate this into a maintenance model relating maintenance actions and improvements. We perform an exhaustive optimization to find the best maintenance policy based on reliability, availability, and maintenance cost. We found that focusing on availability and cost, the reliability of the system would drop below the accepted minimum, but allowing for substantial cost savings. The maintenance policy based on reliability proves improves reliability significantly without increasing the maintenance cost, compared to the benchmark situation currently performed. In general, extending maintenance intervals needs to be done carefully because maintenance costs are discontinuous and have sudden jumps.
We recommend exploring further possibilities to optimize the maintenance intervals based on multi-component optimization, which could then expand beyond the braking system and encompass multiple systems with more complex economic/structural dependence.29,30 This would result in more complex expressions of failure rates, an expression of reliability, availability dependent on more processes. Finally, optimizing the PM interval for such a situation would need agreement between systems, severity of the failure of the different systems, and availability of different workforce for performing the required check/maintain operations. We did an exploratory step in this direction towards multi-component systems given in Haans. 31 The maintenance interval could be further optimized by some combinatorial optimization methods. 32 In our work, we used an exhaustive numerical optimization approach where we investigated maintenance actions for specific maintenance intervals. The computation time is acceptable for the current setup, but might require more sophisticated approaches with more complex systems. As the maintenance time has been evaluated as crucial with regard to availability of the system, the workshop capacity could be studied more in detail. The company showed extreme interest in the theoretical work here described, which has been picked up by maintenance managers in their vision. A (gradual) path towards implementation of the PM policy is therefore a very interesting idea.
