No Sudden Death of Disk

Tags
Language: en   2015-02-13

This is about pending blocks on harddisk. The agony of two weeks was monitored by smartmontools and graphed by Munin.

Munin Graph from smart_ plugin

It started with two warning eMails from smartd..


65528 Offline uncorrectable sectors

This email was generated by the smartd daemon running on:
  host name: XY
  DNS domain: example.org
  NIS domain: (none)
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 65528 Offline uncorrectable sectors
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

65528 Currently unreadable (pending) sectors

This email was generated by the smartd daemon running on:
  host name: XY
  DNS domain: example.org
  NIS domain: (none)
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 65528 Currently unreadable (pending) sectors
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

Although the reports said "No additional email messages about this problem will be sent." I got a bunch of these mails in the next two weeks. This comes because the state of these blocks toggled between "pending" and "ok" multiple times a day. This can be seen on the Munin Graph, fed with smartmontools attributes values every five minute. Have a look at the attributes Current Pending Sector and Offline Uncorrectable in the graphs legend beneath. The drawn normalized values toggle between 100 (ok) and 1 (fail).

After the first warning email I had started an extended selftest via "smartctl -tlong /dev/sdb", but it was completed successfully and the pending blocks had vanished. But then soon I got the next warning email and after some hours another again.I first assumed that the other trigger for the flapping values could be the Offline Selftest, that the disk itself runs every four hours. (smartd activates this sort of selftesting on the disk when starting.) But the switch rate is higher than 4 hours..

This went on for 2 weeks. and in the end the disk died all of a sudden. Here the last picture. <R.I.P.>

Munin Graph from smart_ plugin

Post scriptum: It turned out that another user of Seagate st3000dm001 disks regulary got the exact same number I stated in Currently unreadable (pending) sectors. He told that they come and disappear as they came. He could isolate his problem to a firmware issue, only the disks with elder firmware (in his case below CC4H) had this trouble. I looked up the information for my disk. It had very old firmware on board: Version "CC4B"

scsi 1:0:0:0: Direct-Access     ATA      ST3000DM001-.. CC4B PQ: 0 ANSI: 5