Steady state theory and hard drives

d00df00d · Feb 22, 2015

Originally Posted By: emg
Originally Posted By: d00df00d
Won't that just cause data corruption? Doesn't sound like something that'd lead to drive failure.

Same thing, from the user's viewpoint.

Sounds like this thread is from an admin's perspective, not a user's...

HangFire · Feb 23, 2015

Originally Posted By: Subdued
We're not talking about a desktop "server" here

Even 14 years ago Dell's PERC was a pretty darn good RAID controller.

Replace one drive, wait for rebuild, replace another drive, etc. and there would be literally no downtime

Uh... no. I have a lot of experience with 14 year old PERC Controllers.

There is no "another" nor any "etc". RAID 5 (and there was no RAID 6 in old PERC's back then) can only rebuild for a single drive failure, no matter what the number of drives are that make up the array. Thus, no "another". When RAID 6 came along that number raised to two, thus there is no "etc". Even 10 years ago not every PERC sold offered RAID 6.

Originally Posted By: Subdued

So I'm a little confused why someone would take an image on a server just to proactively swap out some identical drives.

While not necessary, the ability to swap back a full-up working system is a HUGE timesaver. An O/S install takes an hour, installing the backup software takes a half hour, restoring the backup catalog can take many hours, depending on the size of the volumes. Some reboots and some configuration later and then... you are ready to BEGIN restoring the original system.

Contrast that with a.) Swap in old HDD, boot and b.) run "restore".
Originally Posted By: Subdued

Granted I am assuming RAID, but IMO if you're not running some kind of RAID1/5/6 you don't really have a reliable server...

Agreed. If it doesn't have ECC and RAID it's not a Server.

I would have loved to run RAID 50 or other more redundant configuration, but older PERC's simply didn't support those modes, and even if they did, there simply wasn't enough drive slots, not to mention we weren't working with 2TB drives back then. We needed all the space we could get. For typical server had 6 drive slots, we configured them as a 5-drive RAID 5 and one hot spare. If you had 8 drive slots, then a mirrored boot pair and a 5-drive RAID 5 and one hot spare was a typical setup. Trying to run a 7-drive RAID 5 just drove up the chances of a simultaneous 2-drive failure.

And yes, simultaneous RAID drive failures occurred, for us 4 times over roughly 5 years to my recollection. The first was Maxtor's when the data center A/C failed during a holiday. After that it was buggy WD firmware, the TLER bug. Once WD fessed up and issued a firmware update things got a lot more stable.

Subdued · Feb 24, 2015

Originally Posted By: HangFire
Originally Posted By: Subdued
We're not talking about a desktop "server" here

Even 14 years ago Dell's PERC was a pretty darn good RAID controller.

Replace one drive, wait for rebuild, replace another drive, etc. and there would be literally no downtime

Uh... no. I have a lot of experience with 14 year old PERC Controllers.

There is no "another" nor any "etc". RAID 5 (and there was no RAID 6 in old PERC's back then) can only rebuild for a single drive failure, no matter what the number of drives are that make up the array. Thus, no "another". When RAID 6 came along that number raised to two, thus there is no "etc". Even 10 years ago not every PERC sold offered RAID 6.

Are you telling me when a drive gets swapped out, and then rebuilds completely to restore back to a healthy array, no other drive can ever be swapped out and rebuilt again? Because that's the way you're presenting this, and I don't think that's accurate.

Once a rebuild has completely finished, another rebuild can certainly occur.

Unless those PERCs were actually that bad, but I don't remember that being the case.

I'm well aware of how RAID5 and RAID6 work.

HangFire · Feb 24, 2015

Originally Posted By: Subdued
Are you telling me when a drive gets swapped out, and then rebuilds completely to restore back to a healthy array, no other drive can ever be swapped out and rebuilt again? Because that's the way you're presenting this, and I don't think that's accurate.

No, that's not what I'm presenting. I'm saying RAID 5 won't survive 2 drives failing simultaneously. RAID 6 won't survive 3 drives failing simultaneously.

You keep presenting an ideal case where one drive fails at a time and you have time to finish rebuilding the replacement before the next one fails. That can happen, but reality isn't always so kind.

Subdued · Feb 24, 2015

I don't know, I have thousands of servers behind me right now, most running for years, and 99.99% of the time, that really is the case.

HangFire · Feb 25, 2015

Not sure what you mean. The facts on RAID redundancy really isn't debatable.

As to the reality of failures, it happened to us 4 times over 5 years in a data center with only about 100 servers, but then, we were very, very disk and RAID heavy. Failure #2 was interesting, IT was rebuilding one of their server's RAID 5 after a single failed drive, when another one went. End of volume, right then and there.

OVERKILL · Feb 25, 2015

Originally Posted By: HangFire
Not sure what you mean. The facts on RAID redundancy really isn't debatable.

As to the reality of failures, it happened to us 4 times over 5 years in a data center with only about 100 servers, but then, we were very, very disk and RAID heavy. Failure #2 was interesting, IT was rebuilding one of their server's RAID 5 after a single failed drive, when another one went. End of volume, right then and there.

I had that happen with a PACS server but was able to "revive" the 2nd failed drive by cooling it and keep it online long enough to get the rebuild done. It was pretty touch-and-go for a bit. This was also on an old DELL with a PERC controller. Drives were Seagate SAS 750GB models.

itguy08 · Feb 25, 2015

Originally Posted By: hotwheels
Originally Posted By: PhillipM
Running on SSD's, don't care any more

Too bad SSDs are still rather costly. How much is a 2TB SSD now? $4,000?

hotwheels

Hardly. Just put a 1TB SSD in my iMac for $475 (http://eshop.macsales.com/shop/SSD/OWC/Mercury_6G/) The Samsung 840 is about the same (http://www.amazon.com/Samsung-2-5-Inch-SATA-Internal-MZ-7TE1T0BW/dp/B00E3W16OU)

Buy 2, RAID0 them and be done with it. I think even Windows supports RAID0 in software.

The performance increase on the iMac is night and day!

itguy08 · Feb 25, 2015

Originally Posted By: OVERKILL
Originally Posted By: HangFire
Not sure what you mean. The facts on RAID redundancy really isn't debatable.

As to the reality of failures, it happened to us 4 times over 5 years in a data center with only about 100 servers, but then, we were very, very disk and RAID heavy. Failure #2 was interesting, IT was rebuilding one of their server's RAID 5 after a single failed drive, when another one went. End of volume, right then and there.

I had that happen with a PACS server but was able to "revive" the 2nd failed drive by cooling it and keep it online long enough to get the rebuild done. It was pretty touch-and-go for a bit. This was also on an old DELL with a PERC controller. Drives were Seagate SAS 750GB models.

Been there, done that, spend a weekend replacing failed RAID 5 drives until we got the system stabilized and then offloaded the data to another unit. In 15 years and probably 2000 servers it is the only time it happened.

Now we let the SAN folks deal with disks!

Subdued · Feb 25, 2015

Originally Posted By: itguy08
Originally Posted By: OVERKILL
Originally Posted By: HangFire
Not sure what you mean. The facts on RAID redundancy really isn't debatable.

As to the reality of failures, it happened to us 4 times over 5 years in a data center with only about 100 servers, but then, we were very, very disk and RAID heavy. Failure #2 was interesting, IT was rebuilding one of their server's RAID 5 after a single failed drive, when another one went. End of volume, right then and there.

I had that happen with a PACS server but was able to "revive" the 2nd failed drive by cooling it and keep it online long enough to get the rebuild done. It was pretty touch-and-go for a bit. This was also on an old DELL with a PERC controller. Drives were Seagate SAS 750GB models.

Been there, done that, spend a weekend replacing failed RAID 5 drives until we got the system stabilized and then offloaded the data to another unit. In 15 years and probably 2000 servers it is the only time it happened.

Now we let the SAN folks deal with disks!

Yep when it happens it seriously sucks. But it's very much the exception, and not the rule.

I've seen it 5 maybe 6 times in my 20 year career.

HangFire · Feb 26, 2015

Originally Posted By: Subdued
Yep when it happens it seriously sucks. But it's very much the exception, and not the rule.

I've seen it 5 maybe 6 times in my 20 year career.

Things get more interesting when you've purchased boxes and boxes full of defective firmware enterprise drives. This was around 2004-2005.

For a limited time, WD shipped their WD4000YR and WD5000YR RE2 drives with TLER turned off, essentially turning them into desktop drives not suitable for RAID. The first time they spent a long time recovering a bad block instead of handing the problem back to the RAID controller, they would drop out of the RAID.

A few weeks after getting a bunch of new RAID boxes up, volumes started dropping like flies after a DDT spray.

WD's firmware update fixed the problem, but required we pull the drives out of the RAID and put them in a desktop. That is how I spent a couple of weeks, pulling, updating, and replacing.

I just googled the issue and barely an echo remains behind. WD no longer lists the firmware update.

Doog · Feb 28, 2015

Originally Posted By: beanoil
Turn them on and leave them on. I've been repairing computers and print servers since Novell server ran on IBM DOS. The drives that get replaced most often are the ones that are not UPS protected and have abrupt power loss,(the customer should supply a UPS, but the company I work for does not require one) second up are those that are power cycled regularly. Least frequent are those that are left alone.

We have APC ups units on all computers, phone system, everything. We get an average of 420 surge/sags in power annually here.

Steady state theory and hard drives

d00df00d

HangFire

Subdued

HangFire

Subdued

HangFire

OVERKILL

$100 Site Donor 2021

itguy08

itguy08

Subdued

HangFire

Doog

Similar threads