smart checkin

Checking S.M.A.R.T. values automatically with a bunch of disks

Alright, first of all we’ll get the overall health status:

smartctl -H /dev/sda | awk '/result: /{print $6}'

Then, smartctl -a displays a few columns. Let’s take a look at VALUE and THRESH. In general value should not go below tresh.

while read -r line; do
  title=$(echo $line | awk '{print $2}');
  thresh=$(echo $line | awk '{print $6}' | bc);
  worst=$(echo $line | awk '{print $5}' | bc);
  value=$(echo $line | awk '{print $4}' | bc);
 
  if [ $value -lt $thresh ]; then
    echo "$title: value($value) is less than thresh($thresh)";
  fi
done< <(smartctl -A /dev/sda | tail -n +8 | head -n -1);

Now there are a few entries which I’d check apart from the above:

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13982
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   111   106   000    Old_age   Always       -       36
194 Temperature_Celsius     0x0022   033   043   000    Old_age   Always       -       33 (0 21 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

I’d check that power on hours is not >=4 years. I’d check that reallocated sectors count, runtime bad blocks, reported uncorrect, current pending sectors and offline uncorrectable have a raw value of 0. I’d check that the temperature is between 25 and 45°C. Enhancing the above to:

while read -r line; do
  id=$(echo $line | awk '{print $1}' | bc);
  title=$(echo $line | awk '{print $2}');
  thresh=$(echo $line | awk '{print $6}' | bc);
  worst=$(echo $line | awk '{print $5}' | bc);
  value=$(echo $line | awk '{print $4}' | bc);
  raw=$(echo $line | awk '{print $10}' | bc);
 
  if [ $value -lt $thresh ]; then
    echo "$title: value($value) is less than thresh($thresh)";
  fi
 
  if [ $id -eq 5 ] || [ $id -eq 183 ] || [ $id -eq 187 ] || [ $id -eq 197 ] || [ $id -eq 198 ]; then
    if [ $raw -gt 0 ]; then
      echo "$title: raw value($value) is greater than zero";
    fi
  fi
 
  if [ $id -eq 9 ]; then
    years=$(echo "scale=0; $raw / 24 / 365" | bc);
    if [ $years -ge 4 ]; then
      echo "$title: disk is older($years) than 4 years";
    fi
  fi
 
  if [ $id -eq 194 ]; then
    if [ $raw -ge 45 ]; then
      echo "$title: disk is hotter($raw°C) than 45°C"
    elif [ $raw -le 25 ]; then
      echo "$title: disk is colder($raw°C) than 25°C"
    fi
  fi
done< <(smartctl -A /dev/sda | tail -n +8 | head -n -1);

Now, let’s put that into a script

#!/bin/bash
 
while read -r disk; do
  ret=0
  echo "~~ checking ${disk} ~~"
  health=$(smartctl -H ${disk} | awk '/result: /{print $6}');
  if [ $health != "PASSED" ]; then
    echo "Check the disc, it failed the overall smart health check..."
    ret=1
  fi
 
  while read -r line; do
    id=$(echo $line | awk '{print $1}' | bc);
    title=$(echo $line | awk '{print $2}');
    thresh=$(echo $line | awk '{print $6}' | bc);
    worst=$(echo $line | awk '{print $5}' | bc);
    value=$(echo $line | awk '{print $4}' | bc);
    raw=$(echo $line | awk '{print $10}' | bc);
 
    if [ $value -lt $thresh ]; then
      echo "$title: value($value) is less than thresh($thresh)";
      ret=1
    fi
 
    if [ $id -eq 5 ] || [ $id -eq 183 ] || [ $id -eq 187 ] || [ $id -eq 197 ] || [ $id -eq 198 ]; then
      if [ $raw -gt 0 ]; then
        echo "$title: raw value($value) is greater than zero";
        ret=1
      fi
    fi
 
    if [ $id -eq 9 ]; then
      years=$(echo "scale=0; $raw / 24 / 365" | bc);
      if [ $years -ge 4 ]; then
        echo "$title: disk is older($years) than 4 years";
        ret=1
      fi
    fi
 
    if [ $id -eq 194 ]; then
      if [ $raw -ge 45 ]; then
        echo "$title: disk is hotter($raw°C) than 45°C"
        ret=1
      elif [ $raw -le 25 ]; then
        echo "$title: disk is colder($raw°C) than 25°C"
        ret=1
      fi
    fi
  done< <(smartctl -A ${disk} | tail -n +8 | head -n -1);
  if [ $ret -eq 1 ]; then
    echo -e "  - \e[91mcheck ${disk} manually and monitor it closely.\e[39m";
  else
    echo -e "  + \e[92meverything is fine with ${disk}\e[39m";
  fi
done< <(ls /dev/sd[a-z])

first run

~~ checking /dev/sda ~~
  + everything is fine with /dev/sda
~~ checking /dev/sdb ~~
  + everything is fine with /dev/sdb
~~ checking /dev/sdc ~~
  + everything is fine with /dev/sdc
~~ checking /dev/sdd ~~
  + everything is fine with /dev/sdd
~~ checking /dev/sde ~~
  + everything is fine with /dev/sde
~~ checking /dev/sdf ~~
  + everything is fine with /dev/sdf
~~ checking /dev/sdg ~~
  + everything is fine with /dev/sdg
~~ checking /dev/sdh ~~
  + everything is fine with /dev/sdh
~~ checking /dev/sdi ~~
  + everything is fine with /dev/sdi
~~ checking /dev/sdj ~~
  + everything is fine with /dev/sdj
~~ checking /dev/sdk ~~
  + everything is fine with /dev/sdk
~~ checking /dev/sdl ~~
  + everything is fine with /dev/sdl
~~ checking /dev/sdm ~~
  + everything is fine with /dev/sdm
~~ checking /dev/sdn ~~
  + everything is fine with /dev/sdn
~~ checking /dev/sdo ~~
  + everything is fine with /dev/sdo
~~ checking /dev/sdp ~~
  + everything is fine with /dev/sdp
~~ checking /dev/sdq ~~
  + everything is fine with /dev/sdq
~~ checking /dev/sdr ~~
  + everything is fine with /dev/sdr

Looks fine, hm? Not saying that you shouldn’t use a disk anymore if this reports something bad. Just showing how you could check a bunch of disks quite fast. Extend it.

No Comments

Post a Comment