Disk Testing Procedure

New Disks

Upon receiving a new disk we do the following

  1. If this is a new machine we are setting up, we boot d-i, configure the network, get it to a shell and download our tools (smartctl, bonnie++, time). If this is an existing machine, we just apt-get install the tools
  2. For SSDs, ensure that firmware is up to date
  3. Run smartctl tests
    1. Check the health of the drive
      # smartctl -H /dev/sdX
    2. Make sure there are no existing errors by reading the error and selftest logs
      # smartctl -l error /dev/sdX
      # smartctl -l selftest /dev/sdX
      
    3. Run the short test and check it’s results after a couple minutes
      # smartctl -t short /dev/sdX
      # smartctl -l selftest /dev/sdX
      
    4. Run the long test and check it’s results after a couple hours
      # smartctl -t long /dev/sdX
      # smartctl -l selftest /dev/sdX
      
    5. Read the smart values and make sure nothing is wrong. Check “Reallocated_Sector_Ct” (the number of sectors the drive has had to reallocate, high means that we’ve used up a lot of the spares).
      # smartctl -A /dev/sdX
      
  4. Run a timed badblocks on the disk, and compare the time to existing results for this hardware.
    # time badblocks -s -v -w -b 4096 -c 10240 /dev/sdX
    
  5. If you need performance numbers for the drive: Create a partition and filesystem and run bonnie++ (v1.96 or newer) on the disk and compare the results to existing results for this hardware. I usually run bonnie++ 3 times.
    # fdisk /dev/sdX
    # mke2fs -j /dev/sdX1
    # mount /dev/sdX1 /mnt
    # cd /mnt
    # bonnie++ -u 0 -s 16G -n 512
    

    (these bonnie settings are what’s currently needed to get useful output on the current generation of SSDs and work fine on modern HDDs too, for older HDDs you might want to use smaller. but really you want to use what you have existing results for on similar disks for comparison’s sake) UPDATE: micah is working on automating these tests.
  6. For SSDs do a SATA secure erase in order to reset performance.

Old Disks

When reusing an old disk, use the same procedure as new disks, but pay careful attention to the starting and finishing “Reallocated_Sector_Ct” and make sure it’s not too high (if it goes up a little that OK). Assuming it tests OK, put a sticker on the top of drive that indicates when badblocks was last run.

For Hitachi disks, run the Hitachi ‘Drive Fitness Tool’ to ensure Hitachi thinks the disk is OK. If it fails, RMA the disk.

For SSDs, ensure that firmware is up to date and do a SATA secure erase in order to reset performance.

Disposing of Disks

  • If a disk works but is no longer useful, run badblocks on it to securely wipe the data. If it tests OK then put a sticker on the drive indicating no bad blocks and the date and take it to the recycling place or give it to someone who can use it (although usually by the time we don’t want it, nobody else does either). If badblocks completed, but indicated problems with the disk, write BAD on the disk. We add these note to help anyone that might try to use the disk after us and with the hope that maybe it will get used again rather than just recycled.
  • If a disk isn’t working well enough to run badblocks, destroy the disk to ensure the data is inaccessible. Taking the disk apart is time consuming, but you get some cool magnets out of it and maybe the shiny platters could be used as decorations or in an art project. RePC has a disk punch that puts a big hole in the disk, but they won’t let you watch them punch the hole which defeats the whole purpose of ensuring the data is destroyed.

If the disks is ATA it may support the “secure erase” command. Here are instructions for how to use it.

NOTE: Since we aren’t timing our badblock runs, it’s ok to do them in parallel with other disks.

WARNING: even after running badblocks, there is some risk of data still being on the drive. If sectors are reallocated by the drive, the old sectors that are marked bad may still be readable and contain data. The ATA secure erase procedure is supposed to attempt to delete these as well.

 

One note about smartctl. You are including the type of disk in the commands above (-d ata). This typically does not need to be included as smartctl is able to detect the proper type from the controller information. Not a big deal to include it, except that we may have other types of disks that we will need to test, in which case the -d ata would be wrong.

Also, what do you about disks that are not included in the smart database? There is a way to submit them, right?

 
 

I found somewhere that you can use -c 98304 for a machine with a gig of RAM. The larger size will speed up the process. I’m not sure I fully understand, but if we are working with 4096 byte blocks, and we do 98304 of them at the same time, that calculates out to: 4096bytes*98304blocks = 402653184 which is 384 megabytes. I bet we could do 3x that amount.

 
 

The badblocks defaults are horribly low, I picked "-b 4096 -c 10240 as something slightly better, but we could turn it up if we know it helps.

 
 
Run a timed badblocks on the disk, and compare the time to existing results for this hardware.

Where does one find the “existing results for this hardware”?

 
 

Was talking to people at koumbit and what they do is have a DBAN image available over tftp boot which they use to wipe a disk, might be a nice way to do it in a semi-automatic way.

Would be nice if DBAN automatically did SATA Secure Erase in part of its process!

 
 

DBAN does scare me a little because my understanding is that once you boot it it nukes things (but I haven’t tried it).

I’ve been using our debirf images to do these things and my plan is to add some scripts that automate it.

 
   

any ideas where to find “existing results for this hardware?” maybe we can start a wiki page that we can update?

edit: i started a page here.