[5799854.385960] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [5799854.385964] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [5799854.385968] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed [5799854.385972] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 84 43 28 00 00 08 00 [5799854.385976] blk_update_request: I/O error, dev sdb, sector 8667947 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 [5799855.795885] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [5799855.795889] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] [5799855.795893] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed [5799855.795897] sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 00 84 43 28 00 00 08 00 [5799855.795901] blk_update_request: I/O error, dev sdb, sector 8667947 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0 [5799856.020331] md/raid:md0: read error corrected (8 sectors at 8667944 on dm-17)
After a couple of months the messages began to appear daily, whenever I would perform the daily system updates. While viewing /proc/mdstat showed the drive as "up", I figured it would be best to replace it sooner rather than later. Before doing that, though, I decided to play around with the S.M.A.R.T. capabilities of the drive in order to see if they would diagnose the drive as in a pre-failure state. To begin, I decided to check the health self-assessment with the -H option to smartctl:
hesse /home/frostsnow # smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.4.13] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Well, the health status said that it was passing, but I wasn't convinced. Perhaps it hadn't been tested? I next ran a short self-test:
/hesse /home/frostsnow # smartctl -t short /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.4.13] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Short self-test routine immediately in off-line mode". Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 2 minutes for test to complete. Test will complete after Sun Mar 29 16:43:35 2020 Use smartctl -X to abort test.
This short test didn't show any errors, so I ran it again out of perplexity, which also didn't show any errors, and then decided to run the longer self-test:
hesse /home/frostsnow # smartctl -t long /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.4.13] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 36 minutes for test to complete. Test will complete after Sun Mar 29 17:20:09 2020 Use smartctl -X to abort test.
Much to my satisfaction, the long test actually reported a read error:
hesse /home/frostsnow # smartctl -l selftest /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.4.13] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 40% 54064 865226 # 2 Short offline Completed without error 00% 54064 - # 3 Short offline Completed without error 00% 54064 - # 4 Short offline Completed without error 00% 0 -
Having now recorded an error during one of its self-tests, perhaps the health self-assessment would now report a failure?
hesse /home/frostsnow # smartctl -H /dev/sdb smartctl 7.0 2018-12-30 r4883 [x86_64-linux-5.4.13] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED
Guess not. Perhaps, as the man page reads, it will only report failing if the drive has already failed or thinks it will fail within the next 24 hours. Unfortunately for me, 24 hours isn't sufficient notice for me as I don't have employees working daily in order to service my home computer; I need about a week's notice so that I can plan for the next weekend. Nice as it would have been to get S.M.A.R.T. to declare that the drive needs replacing, I decided to go ahead and do the replacement anyways.
The trick to removing and replacing the old drive was to attempt to do so as smoothly as possible, with the shortest downtime and the fewest gotchas as possible. To this end I developed a step-by-step plan in order to make the process as smooth as possible:
Simple enough, right? In order to identify the hard drive, I used the venerable hdparm program to grab the drive's serial number (redacted in the below output):
hesse /home/frostsnow # hdparm -i /dev/sdb /dev/sdb: Model=Maxtor 6Y080M0, FwRev=YAR51HW0, SerialNo=REDACTED Config={ Fixed } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4 BuffType=DualPortCache, BuffSize=7936kB, MaxMultSect=16, MultSect=off CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156250000 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 AdvancedPM=yes: disabled (255) WriteCache=enabled Drive conforms to: ATA/ATAPI-7 T13 1532D revision 0: ATA/ATAPI-1,2,3,4,5,6,7 * signifies the current active mode
From there it was a matter of methodically following the steps that I'd laid out. Generating a key, patching the initramfs, powering down the server, installing the new drive, booting the server—and, er, wait, what? See, my particular motherboard is interesting in that, when power is first applied, the fan starts running at maximum speed before turning off when the motherboard enters a standby state. Well, when I powered on the motherboard, it did the usual fan spin-up, but then, rather than idling, it seemed to cut out, then start again, then cut out, then start again, then cut out... Crap. Something was going wrong.
I then spent some time researching capacitors, since any new domain of knowledge, no matter how simplified, involves a number of considerations that one must take into account before successfully attempting any project in said domain. In this case, I learned the difference between through-hole technology and surface-mount technology; luckily, my board was using through-hole technology. I also couldn't help but notice the unfortunate use of "your" in this old capacitor advertisement. Finding no other details relevant to my issue, I then took a q-tip to clean my capacitors and then tried to glean off as much information as possible from them so that I could figure out what to replace them with. Turns out that all the capacitors were manufactured by Rubycon, model MBZ, and were spec'd for 3300uF at 6.3V with what is presumably a maximum temperature of 105 degrees Celsius. Well, I couldn't find the model specified listed on the Rubycon website, but I did find what is presumably a legitimate data sheet on the capacitor from a random website I found by searching; the data sheet claimed that the capacitor had a 20% tolerance, which was the last missing piece of information I'd need for a replacement.
Information in hand, I tried a couple of local stores, such as PCH Cables and Surplus Gizmos; none of them had what I needed, though Surplus Gizmos could sell me used motherboards with the same capacitors, I was not keen to essentially double the complexity of my replacement project. URS Electronics also did not have any of the capacitors listed. The capacitor manufacturer themselves seemed to only operate in bulk orders, which is understandable since individual capacitors are cheap and tend to be a bulk product. Local and direct options exhausted, I was able to find someone selling the exact same brand of capacitor off of Amazon in sets of 8, and so decided to buy from them; they arrived in a few days, pleasantly beating my expectation of having to wait another week. For the soldering station I'd need, a frien-quaintance was kind enough to give me their old one, along with some solder, flux, and desoldering braid. By the time I'd acquired everything, it was about 9 p.m. on Sunday evening, a week after I had first powered-off my server (so much for minimizing downtime). Heedless of the time, I wanted my board fixed that weekend and thus began working on it.
This was my second time soldering, ever, and I'd never desoldered before. Ever cautious, I slowly turned the heat on the station up to max after having no luck with the lower temperatures. Unsure how to use the braid to properly remove the solder with the capacitors in there, I wiggled the capacitors out by heating one of the two holes at a time and then pulling on the capacitor; this eventually succeeded in getting the capacitors out, but left a hole full of solder that was no good for inserting capacitors into. Now the real frustration began for me. I had a very difficult time getting the braid to remove the solder in the holes. After getting frustrated and doing some searching, I learned to apply the flux to the braid; that seemed to help, but it didn't seem to be enough to fully clean the hole and got gunk all around the hole. After about 3-4 hours of this it was about 2 a.m. and I only had half of the six holes cleaned. I feel asleep exhausted and frustrated. The next morning I took a closer look at the soldering iron in the daylight and noticed that either there was some crud on part of the tip or that it had been worn away; either way, a small part of the tip seemed to be effectively heated while the other bit remained cool. Taking this into account, I attempted to desolder with the hot part of the tip at lunch. After a few attempts, I removed most of the solder from the 4th hole, and a bit more from the last two holes, which didn't have much in them anyways. After the holes were cleared, I took care to align the capacitors and resolder them, which was much easier than desoldering them, though most of them came out a little crooked.
Before powering the motherboard on, I attempted to remove the gunk on the board with some rubbing alcohol, though the shop rags I was using left traces of red fiber anyways. Now, at this point I was rather... hesitant about my chance for success. The board had made those terrible sounds thanks to that evil nut on the screw, I'd run a hot soldering iron against the board a number of times because I couldn't hold my hand steady (not to mention any remaining gunk I might have left on the board), and the capacitors being broken might have caused severe electrical damage to the board anyways. Nonetheless, the only thing to do was to try anyways. I found the motherboard product guide online so that I could figure out how to re-attach all of the basic connectors, plugged everything in, powered on the board, and watched as the fan ran, stopped... and stayed stopped. Was it... working? I powered off the board, re-attached the other connectors such as the VGA output and SATA cables, powered on the board and, amazingly, the motherboard was actually working again! Now I could get back to replacing that hard drive.
And that, dear readers, is how one actually fixes a raid array: by replacing capacitors on the motherboard so the thing will actually boot again. Funny how none of the guides that I read ever mentioned that.