Checking S.M.A.R.T. values automatically with a bunch of disks
Alright, first of all we’ll get the overall health status:
smartctl -H /dev/sda | awk '/result: /{print $6}'
Then, smartctl -a displays a few columns. Let’s take a look at VALUE and THRESH. In general value should not go below tresh.
while read -r line; do title=$(echo $line | awk '{print $2}'); thresh=$(echo $line | awk '{print $6}' | bc); worst=$(echo $line | awk '{print $5}' | bc); value=$(echo $line | awk '{print $4}' | bc); if [ $value -lt $thresh ]; then echo "$title: value($value) is less than thresh($thresh)"; fi done< <(smartctl -A /dev/sda | tail -n +8 | head -n -1);
Now there are a few entries which I’d check apart from the above:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 13982 183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 111 106 000 Old_age Always - 36 194 Temperature_Celsius 0x0022 033 043 000 Old_age Always - 33 (0 21 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
I’d check that power on hours is not >=4 years. I’d check that reallocated sectors count, runtime bad blocks, reported uncorrect, current pending sectors and offline uncorrectable have a raw value of 0. I’d check that the temperature is between 25 and 45°C. Enhancing the above to:
while read -r line; do id=$(echo $line | awk '{print $1}' | bc); title=$(echo $line | awk '{print $2}'); thresh=$(echo $line | awk '{print $6}' | bc); worst=$(echo $line | awk '{print $5}' | bc); value=$(echo $line | awk '{print $4}' | bc); raw=$(echo $line | awk '{print $10}' | bc); if [ $value -lt $thresh ]; then echo "$title: value($value) is less than thresh($thresh)"; fi if [ $id -eq 5 ] || [ $id -eq 183 ] || [ $id -eq 187 ] || [ $id -eq 197 ] || [ $id -eq 198 ]; then if [ $raw -gt 0 ]; then echo "$title: raw value($value) is greater than zero"; fi fi if [ $id -eq 9 ]; then years=$(echo "scale=0; $raw / 24 / 365" | bc); if [ $years -ge 4 ]; then echo "$title: disk is older($years) than 4 years"; fi fi if [ $id -eq 194 ]; then if [ $raw -ge 45 ]; then echo "$title: disk is hotter($raw°C) than 45°C" elif [ $raw -le 25 ]; then echo "$title: disk is colder($raw°C) than 25°C" fi fi done< <(smartctl -A /dev/sda | tail -n +8 | head -n -1);
Now, let’s put that into a script
#!/bin/bash while read -r disk; do ret=0 echo "~~ checking ${disk} ~~" health=$(smartctl -H ${disk} | awk '/result: /{print $6}'); if [ $health != "PASSED" ]; then echo "Check the disc, it failed the overall smart health check..." ret=1 fi while read -r line; do id=$(echo $line | awk '{print $1}' | bc); title=$(echo $line | awk '{print $2}'); thresh=$(echo $line | awk '{print $6}' | bc); worst=$(echo $line | awk '{print $5}' | bc); value=$(echo $line | awk '{print $4}' | bc); raw=$(echo $line | awk '{print $10}' | bc); if [ $value -lt $thresh ]; then echo "$title: value($value) is less than thresh($thresh)"; ret=1 fi if [ $id -eq 5 ] || [ $id -eq 183 ] || [ $id -eq 187 ] || [ $id -eq 197 ] || [ $id -eq 198 ]; then if [ $raw -gt 0 ]; then echo "$title: raw value($value) is greater than zero"; ret=1 fi fi if [ $id -eq 9 ]; then years=$(echo "scale=0; $raw / 24 / 365" | bc); if [ $years -ge 4 ]; then echo "$title: disk is older($years) than 4 years"; ret=1 fi fi if [ $id -eq 194 ]; then if [ $raw -ge 45 ]; then echo "$title: disk is hotter($raw°C) than 45°C" ret=1 elif [ $raw -le 25 ]; then echo "$title: disk is colder($raw°C) than 25°C" ret=1 fi fi done< <(smartctl -A ${disk} | tail -n +8 | head -n -1); if [ $ret -eq 1 ]; then echo -e " - \e[91mcheck ${disk} manually and monitor it closely.\e[39m"; else echo -e " + \e[92meverything is fine with ${disk}\e[39m"; fi done< <(ls /dev/sd[a-z])
first run
~~ checking /dev/sda ~~ + everything is fine with /dev/sda ~~ checking /dev/sdb ~~ + everything is fine with /dev/sdb ~~ checking /dev/sdc ~~ + everything is fine with /dev/sdc ~~ checking /dev/sdd ~~ + everything is fine with /dev/sdd ~~ checking /dev/sde ~~ + everything is fine with /dev/sde ~~ checking /dev/sdf ~~ + everything is fine with /dev/sdf ~~ checking /dev/sdg ~~ + everything is fine with /dev/sdg ~~ checking /dev/sdh ~~ + everything is fine with /dev/sdh ~~ checking /dev/sdi ~~ + everything is fine with /dev/sdi ~~ checking /dev/sdj ~~ + everything is fine with /dev/sdj ~~ checking /dev/sdk ~~ + everything is fine with /dev/sdk ~~ checking /dev/sdl ~~ + everything is fine with /dev/sdl ~~ checking /dev/sdm ~~ + everything is fine with /dev/sdm ~~ checking /dev/sdn ~~ + everything is fine with /dev/sdn ~~ checking /dev/sdo ~~ + everything is fine with /dev/sdo ~~ checking /dev/sdp ~~ + everything is fine with /dev/sdp ~~ checking /dev/sdq ~~ + everything is fine with /dev/sdq ~~ checking /dev/sdr ~~ + everything is fine with /dev/sdr
Looks fine, hm? Not saying that you shouldn’t use a disk anymore if this reports something bad. Just showing how you could check a bunch of disks quite fast. Extend it.
No Comments