Dealing adequately with technical uncertainties

Statistics, RAMS & Quality Management
Search this site:
Principles of MTBF calculations, basic assumptions and consequences
  What does MTBF mean
  How do MTBF calculations work
  MTBF calculations according to standards
  Cost of MTBF calculations

As said earlier, MTBF means Mean Time Between Failures, which is the average time between two consecutive failures of an item. MTBF values are usually given in hours.
MTBF and the so called Failure Rate have a reciprocal relationship:  MTBF = 1/Failure Rate, and Failure Rate = 1/ MTBF.
While MTBF seems to be more intuitive, it is quite difficult to handle in calculations, because MTBF is not an additive metric. Therefore, the (additive) Failure Rate is the preferred metric used in MTBF calculations, since failure rates of piece parts simply add up to the failure rate of the assembly. 

MTBF calculations are the cornerstone for almost all quantitative reliability and safety analyses. Without MTBF calculation, these analyses could not even exist. Sometimes however, MTBFs are often used only as a mere sales pitch.

MTBF calculation of a system, in simple words, is just determining the failure rates of every single component and finally adding all these failure rates up in order to obtain the system failure rate (= the reciprocal of the system MTBF).
MTBF calculation requires both component specific parameters and global parameters. Component specific parameters would be the resistance of a resistor, the viscosity of the grease of a gear, the # of actions / h of a switch, and so on, while global parameters would be ambient temperature and environment type.

While there are virtually no standards available for mechanical components, the reliability analyst can chose from at least 6 international standards for electronic components. Therefore, MTBF calculation almost always means the calculation of the MTBF of electronic systems.

Depending on the standard selected, MTBF results can be very different. Differences of factor 3 are quite usual on PCB level, and even factor 10 is not uncommon on PCB level. The resason for this comes not only from the different approaches of the standardas, but also from the uncertainty of the assumptions made in these approaches.  

Despite the lack of standards for mechanical components, MTBF calculations are sometimes performed for mechanical equipment. These calculations however are even more uncertain than those for electronic equipment. They are very often based upon rough estimations, comparisons with similar equipment, engineering judgment, parametric approaches, etc., or on the so called NPRD-1995 catalog (Nonelectronic Parts Reliability Data), published by RiAC. While the successors NPRD-2011 and -2016 are obviously newer, their coverage ( # of different component types addressed) is significantly lower than NPRD-1995.

Now let's focus on electronic equipment

Most MTBF calculations are performed using  bills of materials (BOMs) as they are exported from ERP systems. Such BOMs usually contain sufficient information in order to assess the required parameters. If not, manufacturer part numbers (also contained in BOMs) can be used for internet research in order to obtain those parameters that cannot directly be derived from the BOMs.

The wording "using just
bills of materials " means that electrical schematics aren't used at all for most MTBF calculations. 
An important contributor to component failure rates is the relative electrical stress of components in comparison to their ratings. Electrical stress strongly depends on the electrical context, and therefore electrical schematics would be the preferred means for assessing component stress.
However, practical experience shows, that using average electrical stress values for all components makes almost no difference in MTBF on system level in comparison to assessing every component individually.
Since the system failure rate is just the sum of all component failure rates, and provided that reasonable average stress levels are applied, the MTBF analyst doesn't even need to understand the electrical schematics in order to calculate a valid system MTBF. He only needs to know the BOMs
Apart from that, this approach (average stress for all components)  is
a significant time saver in MTBF calculation.

Safety analyses distinguish between dangerous failure modes and safe failure modes. Therefore, it would be important to know the exact failure rates of individual components, and as a consequence, it would be necessary to assess individual stress levels for every component.
However, even safety standards like ISO 13849 suggest that it can be acceptable to use average stress levels in safety analyses.

The fact that using average stress values instead of individual stress
levels yields almost the same result on system level has many reasons:
  1. For some component types, failure rate models don't even ask for electrical stress.
  2. The depth of failure rate models strongly depends on the component type. Some models are quite dedicated using many parameters, while others are quite simplistic using probably only one parameter. The simplistic failure rate models tend to yield higher failure rates with electrical stress not being a model parameter, while the dedicated failure rate models tend to yield lower failure rates with electrical stress being a model parameter. 
  3. According to the so called central limit theorem, the sum of many independent errors results in a relatively small total error.
While MTBF calculations seem to be straight forward, the theory behind these calculations is not.
It starts with the so called bathtub curve, in particular the middle section of that curve.

Constant Failure Rate

The bathtub curve is an idealized sketch showing the failure rate of a product over time. The middle section of that curve has constant failure rate (and therefore constant MTBF) and represents the useful product life phase.
Constant failure rate is way more than just a simplification of whatever dedicated behavior: The mathematical wording

constant failure rate
is equivalent with the wording
random failures,
and this in turn is the same like
this is a mature and perfectly designed product without any systematic failures.

Every systematic failure mode can
at least theoretically be eliminated by design, but there is no means at all to address random failures.
Random failures can be compared with basic noise: It is always present and cannot be avoided. Random failures are caused by acts of nature beyond any control.
Due to its random nature, particular random failures are generally unpredictable. In practice this means that there is no way to predict which unit will fail at what point in time. 
However, what can be predicted is the number of failures during a period of time.
Summarizing the above:
It is generally predictable how many units will fail during a given period of time, but it is impossible to predict which units will fail and when they will fail.
On theone hand this is a restriction, but on the other hand it makes field data evaluation easy: The only thing that needs to be known is the number of units failed within a period of time.
From a practical statistical viewpoint the random failure approach needs only very few data points (at least 3) in order to yield a valid model.
These two circumstances, 1. not knowing which units failed when, 2. only few data points needed, make it at all possible for many companies to evaluate their field data.
More sophisticated failure rate models would require both more data points (= more failures) and the exact knowledge of individual operating times.

Consequences of constant failure rate
  1. If failures occur only randomly, preventive maintenance makes no sense at all because preventive maintenance addresses predictable failures.
  2. A further consequence of constant failure rate is that products don't get older, they are quasi always new. Random failures not only means that future failures are unpredictable, it also means that there is no way to determine how long units have already been running without failure. In other words, in the random failure model there is no way to distinguish between older units and new units.

Serial Model

The so called serial model is a further basic assumption which MTBF calculations rely on. This means that all components of a system are assumed to work in a series chain. If any component fails, the whole system is assumed to fail. This is of course very often not realistic, because
The consequence of the series chain assumption is that MTBF calculations tend to be pessimistic.
A more realistic characterization of system behavior would require additional and deeper analysis methods like
FMEA, Fault Tree, Markov, or Reliability Block Diagrams.
These methods are not used instead of MTBF calculations, but in addition to
MTBF calculations. They cannot replace MTBF calculations because they are based upon component failure rates.


Privacy Policy