yesthattom | Google analysis of hard disk failures

You're viewing

yesthattom's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

Google has released the paper that I’ve been wanting to talk about for months. This is a really interesting read. We collected the S.M.A.R.T. data from hard disks in use for years and learned some very interesting things. Including... running hard drives hot now and then doesn’t seem to produce as many failures as one would think. If you really want to predict failure, well, you’ll have to read the paper.

Flat | Top-Level Comments Only

From:

snowboardjoe.livejournal.com

This is pretty cool. The most interesting aspect of this paper is the effects of temperature. Our people love to keep our computer room at a temperature of 65-68°F. This paper indicates that hard drives fail far less frequently at 80°F. I'm assuming the temperature here is referring to the drive itself and not the ambient temperature, so will need to account for that too.

Might explain why we have so many hard drive failures the past few years.

From:

redsonja.livejournal.com

I'm a bit disappointed that they didn't include a brand-name analysis in the paper, but obviously the focus of the study wasn't a "consumers reports" sort of thing.

Morgan Stanley ran their rooms just at the edge of what is usually considered to be a good temperature - 65F or so. So much equipment was failing that Sun paid for an environmental study that found several problems, one of which was that they were putting their Enterprise-class machines on open "dunkin donuts" style racks next to each other. As one went down the line, the side-vented heated exhaust of one system was being sucked into the next one. By the time you got to the fifth machine in the line, internal machine temperatures were routinely over 150F. I was shown one system board where the components had literally melted off the board. They ended up putting cardboard baffles between the systems to vent the air properly.

Sunfire architecture has an almost NEBS-compilant temperature resiliency built in - perhaps partially as a result of that problem, which ended up costing Sun a heinous amount of money in replaced hardware. So my thought is that there might be a brand-name association with drive heat resiliency as well.