RS Disk problem emails

af123 · Oct 19, 2015

Following on from discussion in http://hummy.tv/forum/threads/pending-sector-wont-reallocate.6617/ , I am looking into sending alert emails from RS if any disk problems are detected. I'm starting this thread to discuss the new feature and collectively determine how it should work.

Initially, I plan to have a global on/off switch for each RS user which will default to on.
It will use the same acknowledgement system as the web interface currently does which will allow you to acknowledge the current level of an attribute and only receive another email if it increases again (useful for reallocated sectors and in my case I have a few offline sectors that I can't find). You'll have to acknowledge that on the web interface rather than RS as it stands.
One email a day whilst the problem exists.

An example email:

I think that the key here is to get that wiki page ( http://wiki.hummy.tv/wiki/Disk_Problem ) right.

Black Hole · Oct 19, 2015

Needs some words in the email to explain how to acknowledge the report to not receive it again, and how to turn them off altogether.

Trev · Oct 19, 2015

But if you have to acknowledge on the WI, does this not rather defeat the object of being able to switch it off to stop the 'spam' emails when you are away?

MymsMan · Oct 19, 2015

af123 said:
One email a day whilst the problem exists.

I think that the key here is to get that wiki page ( http://wiki.hummy.tv/wiki/Disk_Problem ) right.

I would prefer single email when problem when first occurs, or when sector counts change rather than daily, weekly reminder at the most.

Agree that a clear wiki is key

neilski · Oct 20, 2015

Well, I don't feel strongly about the email frequency - once a day until ack wouldn't actually bother me, but neither would once-only (per count change) I guess.

I had a look in the wiki, and I have a question about the bit that says "...recommended that a full maintenance mode disk check is performed.": does "maintenance mode disk check" mean running fix-disk?

(FWIW: I suppose I mightn't instantly bin a disk with just a few reallocs, but I regard a disk with any unreadable sectors at all as a time bomb and thus I'd replace it promptly. I don't think of myself as overly sensitive to disk errors

, having lost data more than once in the past from failing disks, not to mention having spent many hours recovering from such problems, including the horribly long delays to perform a backup on a partially knackered disk as it repeatedly retries reads.)

MymsMan · Oct 20, 2015

Referring to maintenance mode with no link to the instructions is not helpful so I have made the two references into links

af123 · Oct 20, 2015

rs 1.4.0, just published, will upload selected disk S.M.A.R.T. data to the RS servers. It can be viewed through the RS Settings page once the server realises that you have version 1.4.0 installed (run the rs/sync diagnostic to force that if you don't want to wait overnight). There's also a new option that will appear in that screen for enabling warning email messages to be sent. Emails are triggered once every 7 days for boxes which have current, un-acknowledged disk problems.

neilski said:
I had a look in the wiki, and I have a question about the bit that says "...recommended that a full maintenance mode disk check is performed.": does "maintenance mode disk check" mean running fix-disk?

Yes. I don't usually call it that as the disk scan is a selectable menu option in maintenance mode.

(FWIW: I suppose I mightn't instantly bin a disk with just a few reallocs, but I regard a disk with any unreadable sectors at all as a time bomb and thus I'd replace it promptly.

I'm sure you know but it's worth re-stating that an offline or pending sector is more often that not found to be fine once it is written to again. Some drive firmware is worse than others at flagging good sectors as suspect when they are weakly written. Two years after the HDR-Fox T2 was released, we had a large flurry of people with pending/offline sectors posting on here and the majority of them managed to repair them without any reallocations.
As for reallocated sectors, I still stand by the advice in the Wiki. I wouldn't worry about a handful of reallocations, particularly near the start of a drive's life - the thing to watch is the rate at which they are accumulating. There's at least one user on here running with a couple of thousand reallocations - I would have replaced the disk by now but it's only tele!

af123 · Oct 20, 2015

MymsMan said:
Referring to maintenance mode with no link to the instructions is not helpful so I have made the two references into links

Thanks. I think a simpler Maintenance Mode page would be useful too. The current one provides a lot of information and options rather than guiding somebody through their first disk scan and repair. Maybe that should be a new page dedicated to running the disk check.

Black Hole · Oct 21, 2015

What's wrong with Quick Guide to Disk Recovery (click)? I admit it could do with updating.

af123 · Oct 21, 2015

I haven't enabled the emails yet but over 500 devices updated and uploaded smart data overnight.
Oh dear:

Code:

mysql> select smart_status, count(*) from device where smart_status is not null and smart_status != '' group by 1;
+--------------+----------+
| smart_status | count(*) |
+--------------+----------+
| FAILED!  |  3 |
| PASSED  |  540 |
+--------------+----------+
2 rows in set (0.00 sec)

mysql> select smart_realloc, smart_pending, smart_offline from device where smart_status = 'FAILED!';
+---------------+---------------+---------------+
| smart_realloc | smart_pending | smart_offline |
+---------------+---------------+---------------+
|  3360 |  0 |  0 |
|  3593 |  1 |  1 |
|  8341 |  8 |  8 |
+---------------+---------------+---------------+
3 rows in set (0.01 sec)

mysql> select count(*) from device where smart_realloc > 0;
+----------+
| count(*) |
+----------+
|  45 |
+----------+
1 row in set (0.00 sec)

af123 · Oct 21, 2015

Black Hole said:
What's wrong with Quick Guide to Disk Recovery (click)? I admit it could do with updating.

It's too wordy overall for what I want on a landing page and not detailed enough under step 3.
I'll probably write something shorter (with pictures!) and reference your guide for people who want more information or a more step-by-step process.

neilski · Oct 21, 2015

af123 said:
I haven't enabled the emails yet but over 500 devices updated and uploaded smart data overnight.
Oh dear:

Aha, some fun statistics! (Not clear to me why you're saying "oh dear" though. 3 failing out of 543 isn't too scary.)
I am slightly puzzled that the disk with 8 pending sectors, which I'd therefore tend to assume is AF, has a realloc count which clearly isn't a multiple of 8. Oh well...

It'll be interesting to see if those failing disks are still sending data to RS in another 6 months ;-) (Are the serial numbers logged too?) Great opportunities for studying the bad end of the bathtub curve...

And there are plenty of disks with reallocs and not "FAILED" - do they have any pending sectors?

Black Hole · Oct 21, 2015

af123 said:
It's too wordy overall for what I want on a landing page and not detailed enough under step 3.
I'll probably write something shorter (with pictures!) and reference your guide for people who want more information or a more step-by-step process.

Fair dos

af123 · Oct 21, 2015

neilski said:
Aha, some fun statistics! (Not clear to me why you're saying "oh dear" though. 3 failing out of 543 isn't too scary.)

Might be for those three : ) When I turn it on it will send emails to 45 people though.

It'll be interesting to see if those failing disks are still sending data to RS in another 6 months ;-) (Are the serial numbers logged too?) Great opportunities for studying the bad end of the bathtub curve...

Not serial numbers but I have MAC address.

And there are plenty of disks with reallocs and not "FAILED" - do they have any pending sectors?

That isn't too surprising, the vendor has probably set the disks to report FAILED once there have been more than 2000 or so rellocations.

Code:

mysql> select smart_status, smart_realloc, smart_pending from device where smart_realloc > 0 order by 2 desc;
+--------------+---------------+---------------+
| smart_status | smart_realloc | smart_pending |
+--------------+---------------+---------------+
| FAILED!  |  8341 |  8 |
| FAILED!  |  3593 |  1 |
| FAILED!  |  3360 |  0 |
| PASSED  |  3016 |  120 |
| PASSED  |  2172 |  1 |
| PASSED  |  1149 |  0 |
| PASSED  |  1109 |  0 |
| PASSED  |  799 |  0 |
| PASSED  |  686 |  503 |
| PASSED  |  504 |  0 |
| PASSED  |  471 |  0 |
| PASSED  |  367 |  0 |
| PASSED  |  320 |  2 |
| PASSED  |  297 |  0 |
| PASSED  |  179 |  4 |
| PASSED  |  145 |  0 |
| PASSED  |  144 |  0 |
| PASSED  |  128 |  0 |
| PASSED  |  122 |  2 |
| PASSED  |  118 |  0 |
| PASSED  |  109 |  16 |
| PASSED  |  103 |  0 |
| PASSED  |  74 |  0 |
| PASSED  |  66 |  0 |
| PASSED  |  65 |  0 |
| PASSED  |  41 |  0 |
| PASSED  |  38 |  0 |
| PASSED  |  27 |  0 |
| PASSED  |  17 |  0 |
| PASSED  |  16 |  0 |
| PASSED  |  16 |  0 |
| PASSED  |  12 |  0 |
| PASSED  |  11 |  0 |
| PASSED  |  11 |  0 |
| PASSED  |  10 |  2 |
| PASSED  |  8 |  0 |
| PASSED  |  8 |  0 |
| PASSED  |  8 |  0 |
| PASSED  |  8 |  0 |
| PASSED  |  8 |  0 |
| PASSED  |  8 |  0 |
| PASSED  |  8 |  0 |
| PASSED  |  8 |  0 |
| PASSED  |  5 |  0 |
| PASSED  |  4 |  0 |
| PASSED  |  3 |  0 |
+--------------+---------------+---------------+
46 rows in set (0.00 sec)

cdmackay · Oct 22, 2015

> | PASSED | 2172 | 1 |

ooh! that's me

I am not happy to be 4th overall, and 2nd in PASSED

af123 · Oct 22, 2015

I wouldn't worry - I doubt you'll stay PASSED for long ; )

mihaid · Oct 22, 2015

I definitely like the way this thread is proceeding. For the failed disks how many start/stops have been counted?

MontysEvilTwin · Oct 22, 2015

Are eight reallocated sectors on an AF disk equivalent to one sector on a non-AF one, with respect to the statistics? Aren't AF disk sectors always going to be reallocated in blocks of eight?

af123 · Oct 22, 2015

MontysEvilTwin said:
Are eight reallocated sectors on an AF disk equivalent to one sector on a non-AF one, with respect to the statistics? Aren't AF disk sectors always going to be reallocated in blocks of eight?

I don't know and I expect it depends on the disk vendor and maybe even firmware version. I would expect the reallocated field to represent physical sectors but I have no proof of that.

af123 · Oct 22, 2015

mihaid said:
I definitely like the way this thread is proceeding. For the failed disks how many start/stops have been counted?

That information isn't uploaded, just the number of reallocated/pending/offline sectors and the overall status. It would be interesting to acquire more data such as disk model, start/stop counts and power-on hours but it does not strictly need those in order to send the warning emails. I think that would have to be opt-in which would make it difficult to get a representative sample. The terms of RS allow for the collection and storage of "Device information including hostname, MAC address, model, firmware version and disk utilisation" - but I'm wary of pushing that 'including' too far.

RS Disk problem emails

Administrator

May contain traces of nut

The Dumb One

Ad detector

Member

Ad detector

Administrator

Administrator

May contain traces of nut

Administrator

Administrator

Member

May contain traces of nut

Administrator

Active Member

Administrator

Well-Known Member

Well-Known Member

Administrator

Administrator