Reliability 101

What is reliability?

A product or process is considered to be reliable if it functions properly when expected. More specifically, in the field of engineering, reliability is often described in the following terms:

Reliability is the probability that a system will operate successfully for a specified period of time, under specified conditions, when used for the manner and purpose for which it was intended.

Reliability engineering is a branch of engineering aimed at predicting, analyzing, and preventing or mitigating failures over time.

Why is reliability important?

Overall corporate success depends on the reliability of a company's products, processes, and services. Reliability is always a top customer concern and is increasingly vocalized by customers as a major factor in purchasing decisions. What is significant about reliability is that it is industry independent - no matter what an organization offers to the marketplace, reliability plays a key role in determining its success.

Reliability plays a role in many aspects of business. A few examples of the impact of reliability on key corporate objectives are listed below:

  • Reputation. A company's reputation is very closely related to the reliability of its products. The more reliable a product is, the more likely the company is to have a favorable reputation.
  • Warranty costs. If a product fails to perform its function within the warranty period, its repair or replacement negatively affects profits. Additionally, if the failure is either large in magnitude or causes injuries or death, the company may become the focus of unwanted negative attention. Introducing reliability analyses in the early design stage is an important step in taking preventive actions, ultimately leading to a product that is more reliable.
  • Future business. A concentrated effort towards improved reliability shows existing customers that the manufacturer is serious about its product and committed to.
  • Customer satisfaction. This type of attitude has a positive impact on future business. Many companies publish their predicted or field measured reliability metrics to help gain an advantage over competitors who either do not publish their metrics or have poor reliability metrics.
  • Contract requirements. Many customers in today's market demand that their suppliers have an effective reliability program. These customers have learned the benefits of reliability analysis from experience.
  • Cost of ownership. Manufacturers may take reliability data and combine it with other cost information to illustrate the cost-effectiveness of their products. This life cycle cost analysis can prove that accenthough the initial cost of their product might be higher, the overall lifetime cost is lower than a competing product because their product requires fewer repairs or less maintenance.
  • Custom satisfaction. While a reliable product may not dramatically affect customer satisfaction in a positive manner, an unreliable product will severely affect customer satisfaction in a negative way. Thus, high reliability is ultimately a mandatory requirement for customer satisfaction.
  • Safety regulations. Manufacturers of highly critical devices must design their products to comply with current safety regulations. Failure to meet standards for protecting the heaccenth and safety of users and/or the general public can result in the breach of contract and even the loss of life.

Common Reliability Measurements

Because reliability is such a crucial element to business success, analysis techniques and methods have been developed over time to help analyze and measure reliability to enable companies to improve areas of weakness. The field of reliability engineering is devoted to the development of tools and techniques to help companies reach their reliability goals.

There are many metrics commonly used in reliability engineering to help assess reliability. Oftentimes, these metrics must be considered reference points. These metrics can help determine whether improvements have been made and if goals are being achieved. Additionally, metrics are important for establishing a baseline for reliability analysis.

The following table describes some of the most common reliability measures that can be obtained from reliability analyses. While there are many more reliability measurements available, these parameters are widely used.

Reliability Measurement Table

Reliability Measure Description
Failure Rate The expected rate of occurrence of failure or the number of failures in a specified time period. Failure rate is typically expressed in failures per million or billion hours. For example, if your television has a failure rate of five failures per million hours, you can watch one million hour-long television shows and likely experience a failure during only five shows!
Mean Time Between Failures (MTBF) The number of hours to pass between failures. MTBF is typically expressed in hours. As an example, if the MTBF of your television is 1000 hours, you can on average watch it for 1000 hours before it will fail. This does not indicate that your television will fail exactly in the 1000th hour of watching, but that it will generally fail somewhere around 1000 hours of operation. For many systems (those which are defined as "constant failure rate systems"), MTBF is the inverse of the failure rate. This means that if the failure rate is two failures per million hours, the MTBF is equal to 1000000/2, or 500,000 hours.
Mean Time to Failure (MTTF) The average time to failure for a system that is not repairable. For example, a satellite is typically not repairable. In such a case, the MTTF, or Mean Time to Failure, is used as the metric for measuring reliability. Once the failure occurs, the system cannot be replaced or repaired. This is different from a repairable system (television) described in the above example. If your television fails somewhere around the 1000th hour, you will get it repaired. It will then operate for about another 1000 hours (on average) or so before it needs to be repaired again.
Reliability The probability that the item will perform a required function without failure under stated conditions for a stated period of time. Reliability is significant because it takes into account time. The measure of reliability answers the question: "How likely is it that my system will remain operational over a certain period of time?" Because reliability is expressed as probability, it is always a value between 0 and 1.
Availability Availability, also a probability value, indicates the probability that a system is operating at a particular point in time. It answers the question: "How likely is it that my system is operating X hours out of a Y hour time frame?" Availability differs from reliability because it factors repairs into the measurement. To determine availability, the time to perform a repair must be known. Because availability is expressed as a probability, it is a value between 0 and 1.
Mean Time to Repair (MTTR) The average time to return a failed item to an operable state. The MTTR, or Mean Time to Repair, is normally expressed in hours and indicates how long it takes to repair a system that is down due to a failure. Generally, MTTR does not include logistics time, such as the time required to receive a replacement part. MTTR indicates the actual time it takes to correct the problem.
Unreliability Unreliabiity is the complement of reliability. If your reliability is 0.9, the unreliability is equal to 0.1 (1.0 - 0.9). It is a probability value, or a number between 0 and 1, that indicates the likelihood that a system cannot continuously operate up to a specified point in time.
Unavailability Like unreliability, unavailability is the complement of availability. It is a probability value, or a number between 0 and 1, that indicates the likelihood that a system is not operatinal for a specified period of time.

There are many sources of confusion surrounding the use of reliability measures, especially in regards to MTTF and MTBF.

What reliability analysis techniques are currently available?

Because the reliability field is broad, there are many different types of reliability analysis techniques available. The table below outlines some of the more widely used analyses for reliability assessment. For organizational purposes, these techniques have been classified into three categories: Reliability Evaluation, Risk Assessment, and Specific Use Tools. However, the techniques are not limited to that one category (e.g. FTA can be used for Risk Assessment and Reliability Evaluation).

Reliability Analysis Technique Description
Reliability Evaluation
The techniques is this category are used to determine system reliability metrics.
Reliability Prediction Used to predict the likely failure rates of components and/or systems. Predicted failure rates are computed using statistically developed models based on component and environmental parameters.
Reliability Block Diagram (RBD) Used to assess reliability metrics of complex systems that employ redundancy and other methods to increase reliability. RBDs use sophisticated modelling techniques, including system simulations, to determine reliability.
Markov Analysis A very broad analysis tool that can be used to model a wide array of complex, state dependent systems. Markov analysis is used to calculate the reliability results for varied systems, especially those with sequence dependencies.
Risk Assessment
The techniques in this category identify potential risk areas and assess techniques or actions that can be used to minimize risk factors. These techniques are used extensively in safety-related systems.
Failure Modes and Effects Analysis (FMEA) Used to identify the effects of various failure modes on a system, identify how critical the effects of these failurs are on a system, and identify actions to elminate or mitigate those failure modes.
Fault Tree Analysis (FTA) Used to identify all possible causes (or combination of cause) that result in a system failure.
Specific Use Tools
The techniques in this category are used for a particular type of reliability assessment.
Failure Reporting, Analysis and Corrective Action System (FRACAS) Widely used process for recording field failures, analyzing those failures for root cause, identifying trends across a population of products, and tracking corrective and preventive actions intended to preclude recurrence of the root cause.
Weibull Analysis General, encompassing statistical tool for assessing failure time distributions via optimization techniques and predictive analyses. Used for failure forecasting based on sampls of actual field, or life data.
Accelerated Life Testing (ALT) Uses statistical methods to analyze accelerated life testing data identify characteristic failure behavior and predict product reliability.
Maintainability Prediction Used to determine the Mean Time to Repair (MTTR) for various components to aid in the development of maintenance plans for the system.
Life Cycle Cost Analysis Used to determine the total lifetime costs of the system, factoring in metrics such as failure rates and repair costs.

Oftentimes, the reliability analyst may employ several tools depending on requirements. For example, if you were required to forecast the rate at which the system will fail, you would have to perform a reliability prediction. If you were also required to identify all of the possible events that could result in a system failure, you would have to perform a FMEA.

Summary

The importance of reliability for business success is undeniable. To accurately track and measure reliability parameters, a wide array of techniques has been developed. Many of these analyses can be performed using software tools developed for the reliability practitioner. Windchill Quality Solutions is a reliability analysis toolset that can help you to perform a number of these analysis types. For additional information about Windchill Quality Solution software products, please visit crimsonquality.com/products.