Sunday, February 10, 2013

Bad blocks whatcha gona do when they come for you

This is an horror story about a hard drive failing and keeping my data away from me.

This post got a lot bigger than what I expected, TL;DR: Broken HD, testdrive + fsck will save your data from broken blocks which corrupt your superblock.

Imagine you went to the cinema and when you get back you want to do a nice pacman -Syu and look if the new KDE 4.10 have landed in the repos. You naturally go and power on your desktop and what appen next is this, at boot time just after udev start trying to trigger events.

ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: BMDMA stat 0x24
ata2.00 failed cmmand READ DMA EXT
....
ata2.00 status {DRY ERR}

ata2.00 errro {UNC}

Ok, something is wrong with the hard drive. It can't be that bad since I have been using the computer just few hours ago and turned off normally.

Start debugging

Let's try to boot the good old Archlinux 2010 install cd and see what happens. It turns out that booting from the livecd I got just the same error. Mmm that is wrong. AFAIK the live cd should not touch the hard drive until I want to mount it.

Since I could not boot from livecd, I was wondering if I can see my data partition from my Windows system (I have dual-boot system to be able to play Starcraft2, Blizzard, please free me and give us a linux native game!). I knew it was a long shot since, if linux can't boot, sure Windows will never boot too. Still I was out of options so I tried it. Surprisingly, it worked, Windows booted just normally and I was able to see my data partition with the ext3 driver. That is really odd. Still, after browsing a little my files I got a crash on the ext3 driver and it got closed, no problem now I know I can get my files someway and more importantly, they are alive!

I have started thinking that it must be a software error, something wrong on the kernel. So I went back to the live cd approach, come on, I should be able to boot my live cd. I went into a loop of boot, see if something is wrong in the bios, sometimes change the boot parameters as I read some recommendations in the forums, try to boot the live cd, see the errors, try again. ircpool, libata, acpi parameters, noapic. etc. Anything that avoid reading the HD and let me boot the livecd.

Fortunately, in one of the iterations I was lazy and waited too much (like 5-10 mins) before restarting the machine and PUM ! the live cd booted. I still get the kernel errors in the buffer but I also can see the good old rc script running and initializing the sistem. Nice, now I know I can boot my computer.

Now it is time to real debugging

So I went and downloaded a newer version of Archlinux ISO (2013-02-01) so I can have the latest version of every tool and have a nicer resolution since the old ISO predates the KMS.

Let's try to figure out what all this output mean.

My first approach was to search for the exception code. My surprise is that the good "exception Emask 0x0 SAct 0x0 SErr 0x0 action" is a very generic error message. I have found it with a lot of variations all over the internet, but none of them helped me to fix my issue.

I found this great page from the libata guys Libata error messages which explains exactly what does it means all bits in the error message. So I realized that the DRDY was a good thing, the drive was ready but the ERR one means (yeah, you can guess) there is an error set in the registers. The error was, as the UNC code this one means "Uncorrectable error - often due to bad sectors on the disk". We are fried now, my HD is broken and I lost all my information. At least, now I know the problem

In a forum, I saw that there is a nice tool that tells you if there are broken blocks on your HD, so I tried it, badblocks /dev/sda and after 4:30hrs~ I got the answer: 16 bad blocks.
"Bad blocks bad blocks, whatcha gona do, whatcha gona do when they come from you"
Then I tried to run smartctl tests to see if it can fix the issue or give me more information and it will take just... like 5-6 hrs to complete the test, what a pain. Not having anything else in mind, I ran the test  smartctl -t long and went to see a serie for a while. I setup a nice whatch command to show me the output of smartctl -l selftest and see the progress.

From the output of smartctl I figured out that it was showing me which block whas the last found broken block, and after comparing it with my fdisk -l output I noticed that it was the /dev/sda3 offset + 2. Holly! This must mean something.

Let me recap a little here; I have my disk patitioned in this way.
/dev/sda1  NTFS
/dev/sda2  /boot
/dev/sda3 /
/dev/sda4 /home
I was able to Boot Windows from the NTFS partition, nice.
I was able to mount /dev/sda2 and see the content, nice.
I was not able to mount /dev/sda3 nor /dev/sda4, and the broken block was just at the beginning of sda3.
Stuff started to show some trend.

I tried debugfs to read the block and let the HD firmware to blacklist it but it keeps failing with a weird error:
debugfs: open /dev/sda3
/dev/sda3: Bad magic number in super-block while opening filesystem
What does that mean? Well I had no idea. The forums says that I may want to try fsck, let's do it, it can be that bad. And the good old fsck fails with another weird error:
fsck.ext3: Attempt to read block from filesystem resulted in short read while trying to open /dev/sda3
Could this be a zero-length partition?

This should be wrong, I have just fdisk -l it, I know it is a complete partition.

forum post mentioned a tool named "testdisk" as a hopeless guy, I ran to man and look what is that tool. The description says "Scan and repair disk partitions" that sounds useful. After the small man I decided to try it out: testdisk /dev/sda3 and Magic! it is able to tell me what was going wrong with my partition, it figured out the copies of the broken superblock and it even tell me the exact command I need to fix my problem:
fsck.ext3 -b <blockaddress> -B <blocksize>.

Ran that nice and beautiful command and I suddenly I got my data back!

Conclusion

What I have learnt here, first of all was:

  • YOU DO MUST HAVE BACKUPS.
  • Testdisk is your friend.
  • fsck is your friend.
  • Archlinux ISO is your friend.
  • *nix is your friend.
  • Hard drives tend to break.

Also, It is important to notice the wide variety of tools we have to help us get our data back. Not always a small physical breakage is the end of the world, you can recover it if you have the patience to read a ton of forums, man pages and dedicate some time to the adventure.

I was happy to learn that ext3 have this redundant structure (copies of the superblock all over the fs) to  help us to recover from breakage. I love it. I don't know how other fs do the trick but I am really happy I am using ext3.

Finally, I would like to thank the Archlinux team for give me a really powerful and nice livecd to help me in this painful trip.

Now I have my system back, I can hear my music and use my files. It is time to setup a redundancy plan to avoid getting panic again due a bad hard drive. But that will be the project for next week.

2 comments:

  1. md raid has saved my ass on numerous occasions... like right now, finally fixing my ~ which was down a drive ;)

    ReplyDelete
    Replies
    1. Yeah, I'm planning to do a RAID1 md clone for my ~partition but I was not able to get a HD equal to the one I have. I got one bigger, with 4096 sector size instead of 512.

      Will try to figure it out this week.

      Delete