[tempmon] Monitor temperature and take action if too high.

af123

Administrator
Staff member
It's been bothering me for a while that there is no easy way of telling if the internal fan has failed so I've written a new tempmon package which checks the HDD temperature every three minutes and can take action if it exceeds configurable thresholds.



The alert looks like this and also creates a notification which will be shown when logging into the web interface.

 

Wallace

Traveler 34122
Just when you think it can't get any better...

Thanks af123.

Update:

I have install this on one of my HDRs to test. Set warnings t0 45/50/60. The actual HDD temp is 26 as the unit hasn't been on long.

The HDR gave a warning in the WebIF 'Temperature is 118 and exceeds standby threshold of 50'. The HDR shut off (standby).

When I powered it on, the WebIF reported '...box crashed, check Crash.log for info...'

Update, update:

I am stuck in a loop! Now when I turn the unit back on (front button), WebIF reports high temperature and puts the unit into standby again.
The unit does not even stay on long enough for me to disable the package.

HELP!

Update, update, update!

Wow, you have to be quick on the button, but I just managed to remove the tempmon package before it shut the unit down. All OK now.

I think I will wait for a package update! Just glad I didn't install it on the other HDR used by SWMBO!
 
Last edited:
OP
af123

af123

Administrator
Staff member
Ouch, sorry about that - worked fine on all three of my boxes!
I'll send you an update file to disable it. You'll need to put it on a USB stick and plug it in during boot.

Package removed from repository for the moment.
 

Wallace

Traveler 34122
Cross posted I think. I managed to remove it.

FWIW, the HDD is a WD 10EVDS as per 'HDR Green' in my sig.
 

Wallace

Traveler 34122
No problem, here is the result:

Code:
>>> Beginning diagnostic tm

Running: tm
smartctl 5.41 2011-06-09 r3365 [7405b0-smp-linux-2.6.18-7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   200   200   051    -    0
  3 Spin_Up_Time            POS--K   192   189   021    -    6375
  4 Start_Stop_Count        -O--CK   095   095   000    -    5997
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   100   253   000    -    0
  9 Power_On_Hours          -O--CK   092   092   000    -    6369
10 Spin_Retry_Count        -O--CK   100   100   000    -    0
11 Calibration_Retry_Count -O--CK   100   100   000    -    0
12 Power_Cycle_Count       -O--CK   095   095   000    -    5995
192 Power-Off_Retract_Count -O--CK   193   193   000    -    5993
193 Load_Cycle_Count        -O--CK   185   185   000    -    46309
194 Temperature_Celsius     -O---K   114   085   000    -    33
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning



>>> Ending diagnostic tm
 

Trev

The Dumb One
It worked OK on mine, but I had to reboot as the WI no l,onger responded. There were a couple of other updates as well, including the WI. Thinking about it, it probably wasn't tempmon as I had to install it after the reboot.
Anyway, set it to warn at 35 and sure enough it did when the temp got there, but on resetting the warning temp to 46 (well above its actual temp), the warning remained until a reboot. Is this by design? If so, how about making it auto reset when the temp drops below the warning threshold?
 

Wallace

Traveler 34122
Here you go...

Code:
>>> Beginning diagnostic tm

Running: tm
smartctl 5.41 2011-06-09 r3365 [7405b0-smp-linux-2.6.18-7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   192   189   021    Pre-fail  Always       -       6375
  4 Start_Stop_Count        0x0032   095   095   000    Old_age   Always       -       5997
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6370
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   095   095   000    Old_age   Always       -       5995
192 Power-Off_Retract_Count 0x0032   193   193   000    Old_age   Always       -       5993
193 Load_Cycle_Count        0x0032   185   185   000    Old_age   Always       -       46309
194 Temperature_Celsius     0x0022   107   085   000    Old_age   Always       -       40
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0



>>> Ending diagnostic tm

Obviously, the unit has warmed up a bit now...
 
Last edited:
OP
af123

af123

Administrator
Staff member
The current temperature is displayed on the settings panel now too so you can check that it seems reasonable!
 
Top