Contact US

Log In

Come Join Us!

Are you an
Engineering professional?
Join Eng-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!

*Eng-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Grid MTBF or Failure Rate

Grid MTBF or Failure Rate

Grid MTBF or Failure Rate

We have a grid of 20 computers where each data is written to at least two computers. A failure or loss of data would occur when two servers fail at the same time. If each computer has an MTBF of 45k hours how do I across calculating the system MTBF where a system failure only occurs when 2 or more nodes fail at the same time?

We can assume 0 time to repair, because within less than a minute the backup data becomes primary and the primary creates a backup across the remaining members of the grid.

Thanks, Gary

RE: Grid MTBF or Failure Rate

I would approach it with Monte Carlo simulation. Model it in the computer. Run a few million simulations of it. Perform necessary statistics.

RE: Grid MTBF or Failure Rate

If you have 20 computers, how is it that you have a system failure if only two go down?  Are you referring to the two computers to which you have just written the last data stream?  In other words, do you really have 10 data streams and always write each stream to the same two computers, or is each data stream written to at least 2 computers chosen at random from the 20?  

"Pumping accounts for 20% of the world's energy used by electric motors and 25% to 50% of the total electrical energy usage in certain industrial facilities." - DOE statistic  (Note: Make that 99.99% for pipeline companies) http://virtualpipeline.spaces.live.com/

RE: Grid MTBF or Failure Rate

Data is federated across all 20 computers. With one piece of data being on exactly 2 computers. So, if I lose 2 computers at the same exact time (or within replication time which is within seconds) I will have data loss. If I lose 1, the backup replicates to the other servers, and then lose another, no problem. Basically, if think of 4M pieces of data federated across 20 computers. The federation scheme guarantees even distribution.

I'm really only concerned with losing 2 computers at the same exact time.

RE: Grid MTBF or Failure Rate

Some quick calculations here.

There are 20C2 or 190 different possible computer/computer pairings for a specific data set at a given time. Of the 190 pairings, your definition of success is only interested in 1 of those pairings. Based on that, I would say the reliability is officially "very high".

To put some more meat on this, if you assume the computer failure rates follow an exponential distribution (not sure about how accurate it is for computers), then the reliability of 1 computer is exp(-t/MTBF).

Since exponential distribution is continuous, the probability of the second failing at the same time approaches 0. Lets be conservative and assume its 100th of the first probability. This implies a higher reliability for the second computer.

They would operate similar to a parallel fashion creating a 2 computer system reliability of: 1-(1-exp(-t/MTBF))(1-exp(-t/MTBF)).

Apply this probability to the binomial distribution to see what the probability of occurrence is:20c2*(1-(1-exp(-t/MTBF)(1-exp(-t/MTBF)))^1*(1-exp(-t/MTBF)(1-exp(-t/MTBF)^189. This quickly approaches 0.

I would never expect to see this failure mode occur in anyone here's lifetime based strictly on the computer's reliability performance. Best of luck!

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Eng-Tips Forums free from inappropriate posts.
The Eng-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Eng-Tips forums is a member-only feature.

Click Here to join Eng-Tips and talk with other members! Already a Member? Login


Close Box

Join Eng-Tips® Today!

Join your peers on the Internet's largest technical engineering professional community.
It's easy to join and it's free.

Here's Why Members Love Eng-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close