Pending Sector won't Reallocate

MontysEvilTwin · Oct 6, 2015

One of my units flagged disk errors (lines 197 and 198 of the smart statistics): there are eight pending and offline uncorrectable sectors. I have run fix-disk; it asks if I want to repair a sector, I select 'y' and it says that this was successful, but then asks if I want to repair the same sector again. It keeps doing this until I select 'n' then it runs more of fix-disk. Fix-disk has not run to completion as it appears to get stuck. The output is below:

Code:

Running /bin/fix-disk
Custom firmware version 3.03

Checking disk sda

Unmounted /dev/sda1
Unmounted /dev/sda2
Unmounted /dev/sda3

Running short disk self test
Waiting... 62
Waiting... 61
Waiting... 60
             
Error at LBA 123128
Do you wish to attempt repair of the bad block? [Y/N]: y

/dev/sda:
re-writing sector 123128: succeeded

Running short disk self test
Waiting... 62
Waiting... 61
Waiting... 60
             
Error at LBA 123128
Do you wish to attempt repair of the bad block? [Y/N]: y

/dev/sda:
re-writing sector 123128: succeeded

Running short disk self test
Waiting... 62
Waiting... 61
Waiting... 60
             
Error at LBA 123128
Do you wish to attempt repair of the bad block? [Y/N]: y

/dev/sda:
re-writing sector 123128: succeeded

Running short disk self test
Waiting... 62
Waiting... 61
Waiting... 60
             
Error at LBA 123128
Do you wish to attempt repair of the bad block? [Y/N]: n
Skipped repair of LBA 123128
Using superblock 0 on sda1
Using superblock 0 on sda2
Using superblock 0 on sda3
Dev: /dev/sda LBA: 123128
LBA: 123128 is on partition /dev/sda1, start: 2048, bad sector offset: 121080
dumpe2fs 1.42.10 (18-May-2014)
Using superblock 0
Block size: 4096
LBA 123128 maps to file system block 15135 on /dev/sda1

Checking to see if this block is in use...
debugfs 1.42.10 (18-May-2014)
Block 15135 is not in use
Dev: /dev/sda LBA: 123128
LBA: 123128 is on partition /dev/sda1, start: 2048, bad sector offset: 121080
dumpe2fs 1.42.10 (18-May-2014)
Using superblock 0
Block size: 4096
LBA 123128 maps to file system block 15135 on /dev/sda1

Checking to see if this block is in use...
debugfs 1.42.10 (18-May-2014)
Block 15135 is not in use
Dev: /dev/sda LBA: 123128
LBA: 123128 is on partition /dev/sda1, start: 2048, bad sector offset: 121080
dumpe2fs 1.42.10 (18-May-2014)
Using superblock 0
Block size: 4096
LBA 123128 maps to file system block 15135 on /dev/sda1

Checking to see if this block is in use...
debugfs 1.42.10 (18-May-2014)
Block 15135 is not in use


Checking partition /dev/sda3...
e2fsck 1.42.10 (18-May-2014)
Pass 1: Checking inodes, blocks, and sizes
hmx_int_stor: |                                                |  0.9%
hmx_int_stor: |=                                               /  2.6%
hmx_int_stor: |==                                              -  3.5%
hmx_int_stor: |==                                              \  5.2%
hmx_int_stor: |===                                             |  6.9%
hmx_int_stor: |====                                            /  8.6%
hmx_int_stor: |=====                                           - 10.4%
hmx_int_stor: |======                                          \ 12.1%
hmx_int_stor: |=======                                         | 13.8%
hmx_int_stor: |=======                                         / 15.6%
hmx_int_stor: |========                                        - 17.3%
hmx_int_stor: |=========                                       \ 19.0%
hmx_int_stor: |==========                                      | 20.7%
hmx_int_stor: |==========                                      / 21.6%
hmx_int_stor: |===========                                     - 23.3%
hmx_int_stor: |============                                    \ 25.1%
hmx_int_stor: |=============                                   | 27.7%
hmx_int_stor: |==============                                  / 29.4%
hmx_int_stor: |===============                                 - 31.1%
hmx_int_stor: |================                                \ 32.8%
hmx_int_stor: |=================                               | 34.6%
hmx_int_stor: |==================                              / 37.2%
hmx_int_stor: |===================                             - 38.9%
hmx_int_stor: |===================                             \ 40.6%
hmx_int_stor: |====================                            | 42.3%
hmx_int_stor: |=====================                           / 44.1%
hmx_int_stor: |======================                          - 45.8%
hmx_int_stor: |======================                          \ 46.7%
hmx_int_stor: |========================                        | 49.3%
hmx_int_stor: |========================                        / 51.0%
hmx_int_stor: |=========================                       - 52.7%
hmx_int_stor: |==========================                      \ 53.6%
hmx_int_stor: |===========================                     | 56.2%
hmx_int_stor: |===========================                     / 57.0%
hmx_int_stor: |============================                    - 58.8%
hmx_int_stor: |=============================                   \ 61.4%
hmx_int_stor: |==============================                  | 63.1%
hmx_int_stor: |===============================                 / 64.8%
hmx_int_stor: |================================                - 66.5%
hmx_int_stor: |=================================               \ 69.1%
                                                                          
Pass 2: Checking directory structure
hmx_int_stor: |======================================          | 80.0%
                                                                          
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
hmx_int_stor: |============================================    / 92.3%
hmx_int_stor: |============================================    - 92.5%
hmx_int_stor: |=============================================   \ 92.8%
hmx_int_stor: |=============================================   | 93.1%
hmx_int_stor: |=============================================   / 93.4%
hmx_int_stor: |=============================================   - 93.7%
hmx_int_stor: |=============================================   \ 93.9%
hmx_int_stor: |=============================================   | 94.2%
hmx_int_stor: |=============================================   / 94.5%
hmx_int_stor: |=============================================   - 94.7%
hmx_int_stor: |==============================================  \ 95.0%
                                                                          
Pass 5: Checking group summary information
hmx_int_stor: |==============================================  | 95.1%
hmx_int_stor: |=============================================== / 97.6%
hmx_int_stor: |=============================================== - 97.7%
hmx_int_stor: |=============================================== \ 97.8%
hmx_int_stor: |=============================================== | 98.0%
hmx_int_stor: |=============================================== / 98.1%
hmx_int_stor: |=============================================== - 98.2%
hmx_int_stor: |=============================================== \ 98.3%
hmx_int_stor: |=============================================== | 98.5%
hmx_int_stor: |=============================================== / 98.6%
hmx_int_stor: |=============================================== - 98.7%
hmx_int_stor: |=============================================== \ 98.8%
hmx_int_stor: |=============================================== | 99.0%
hmx_int_stor: |================================================/ 99.1%
hmx_int_stor: |================================================- 99.2%
hmx_int_stor: |================================================\ 99.3%
hmx_int_stor: |================================================| 99.5%
hmx_int_stor: |================================================/ 99.6%
hmx_int_stor: |================================================- 99.7%
hmx_int_stor: |================================================\ 99.8%
hmx_int_stor: |================================================| 99.9%
hmx_int_stor: |================================================| 100.0%
                                                                          
hmx_int_stor: 15/655776 files (0.0% non-contiguous), 369174/2622464 blocks

Checking partition /dev/sda1...
e2fsck 1.42.10 (18-May-2014)
Pass 1: Checking inodes, blocks, and sizes
/dev/sda1: |====                                                    |  7.8%
/dev/sda1: |=================                                       / 31.1%
/dev/sda1: |==========================                              - 46.7%
/dev/sda1: |===================================                     \ 62.2%
                                                                          
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
/dev/sda1: |=====================================================   | 94.0%
                                                                          
Pass 5: Checking group summary information
/dev/sda1: |======================================================= / 97.8%
/dev/sda1: |========================================================- 99.4%
/dev/sda1: |========================================================| 100.0%
                                                                          
/dev/sda1: 14/65808 files (0.0% non-contiguous), 15318/263168 blocks

Checking partition /dev/sda2...
e2fsck 1.42.10 (18-May-2014)
Pass 1: Checking inodes, blocks, and sizes

I have also tried issuing sector rewrite and repair commands using hdparm but again it claims success without reallocating. The fault seems to be in the first partition, where the EPG is stored. I am copying off my recordings in case I have to reformat or security erase to fix the problem.
Is there a way I can just reformat the first partition as this would be much less hassle than the alternatives?

FYI - the disk is a 1TB Seagate Pipeline (model ST1000VM002) AF disk.

Edit to correct the number of sectors pending to eight.

prpr · Oct 6, 2015

How long did you leave it? It can take a long time to check sda2.
Reformatting will not fix the problem with the disk. The bad sectors on sda1 are not used anyway, so what will you gain?
Reformatting just recreates the filesystem, and there is nothing wrong with it on sda1 or sda3.
From what I've read, a security erase is the only way to fix the bad sectors (never had to do it myself).

MontysEvilTwin · Oct 6, 2015

Fix-disk got to pass 1 of the analysis of sda2 and then did nothing for about an hour. Normally I would expect it to have completed by then, or at least be visibly fixing errors, but there was no indication of any progress: you usually get a % counter which runs up in fits and starts but this was absent.
I also had a missed recording today (hmt only) and the unit is taking several minutes to shutdown fully. These could be coincidences but it does seem a bit sickly. I will set fix-disk running and leave it unmolested for an extended period and see what happens.

neilski · Oct 7, 2015

I'm not familiar with fix-disk. However, any pending sector can be "fixed" by just over-writing it. Perhaps fix-disk is getting mixed up about which sectors it's meant to overwrite. If it worked, you should have in general have seen the reallocated count going up and the pending count going down.

I've previously fixed bad sectors manually (easy when there are only a few) by using the data from the SMART error log (smartctl -a) and judicious (by which I mean bloody careful) use of "dd if=/dev/zero seek=N bs=4k count=1" (or bs=512 count=8) - just calculate N with care. I am guessing that this may be how fix-disk works though? (When it works!

)
(PS: last time I did this, on a fairly new Toshiba disk, it didn't increment the reallocated count, even though pending sectors did go to zero. At the time I took this to mean it was lying, but perhaps it decided that the sector was safely readable again, once it had been overwritten, and thus didn't reallocate it.)

Having said all that, any disk which is showing reallocated or pending sectors is pretty much a replacement job anyway. Fixing the bad sectors may prevent some annoying delays while you copy the data off of course, but where there are a few, there will generally be a lot more soon.

EDIT: just noticed that you have an AF disk - is it possible that fix-disk was written for 512-byte sector drives?

af123 · Oct 7, 2015

neilski - in my experience and from helping lots posters on here at around the two-year-since-release time, overwriting a pending sector is as likely to just flag it as okay as it is to force a reallocation. The firmware has flagged the sector as suspicious because of a CRC problem during a read operation but that does not necessarily mean that there is a physical problem - it could for example mean that a bit has flipped on the disk (which is a much more common occurrence than most people realise). On the next write operation to that sector the drive should verify the sector to confirm if it can be re-used or not. If it looks fine it will set the status back to normal, otherwise it will probably mark the sector as bad and reallocate it.

fix-disk uses 'hdparm --repair-sector' rather than dd with calculated offsets - I'd expect it to be fine on 4K sector drives but you may be on to something there.

neilski said:
Having said all that, any disk which is showing reallocated or pending sectors is pretty much a replacement job anyway

I don't agree with that. In my work I monitor a lot of enterprise drives and a handful of reallocations do not have any correlation with disk failure. Some drives log a dozen within the first month and then nothing else for the next five years. Once the number of reallocations gets too large, performance /can/ start to be impacted and it may indicate an impending drive failure.

MontysEvilTwin said:
One of my units flagged disk errors (lines 197 and 198 of the smart statistics): there are seven pending and offline uncorrectable sectors. I have run fix-disk; it asks if I want to repair a sector, I select 'y' and it says that this was successful, but then asks if I want to repair the same sector again. It keeps doing this until I select 'n' then it runs more of fix-disk. Fix-disk has not run to completion as it appears to get stuck. The output is below:

Can you post the selftest log from your drive? You have 7 pending sectors but it will just keep stopping at the first one it finds. We can use a selective self test to find a different one and see if that one can be fixed.

Code:

humax# smartctl -l selftest /dev/sda

Also can you post the exact sector rewrite command you have tried?

MontysEvilTwin · Oct 7, 2015

I tried following commands and ran fix-disk again after each one:

Code:

hdparm --repair-sector 123128 --yes-i-know-what-i-am-doing /dev/sda

hdparm --write-sector 123128 --yes-i-know-what-i-am-doing /dev/sda

Each command appeared successful, based on the output returned, but fix-disk still asked me if I wanted to repair the same sector and got stuck in a loop when 'y' was selected: see my first post for an example of this loop. Fix-disk claimed that it had fixed the sector each time.

I now think there may be a bug in fix-disk. I managed to reallocate the sectors by manually issuing commands. I first unmounted the partition:

Code:

umount /dev/sda1

I knew that the fault was in sda1 from the fix-disk log. It probably would have been safer to do this from maintenance mode, but I got away with it in this instance. I then tried to read the sector:

Code:

hdparm --read-sector 123128 /dev/sda

This returned all zeros which made me think that the sector had been successfully reallocated after all. I then read the previous sector (123127): this returned a block of data (numbers in hex), and the next (123129). I could not read sector 123129, it returned an input/ output error so I issued the write command for this sector:

Code:

hdparm --write-sector 123129 --yes-i-know-what-i-am-doing /dev/sda

This was successful: I confirmed this by re-reading the sector and getting zeros. I carried on, read the next sector, got an I/O error and wrote to the sector, as above. I rewrote seven sectors and stopped when they started returning data. I rebooted and found that the sectors had been reallocated, with lines 197 and 198 now clear: no pending or ofline uncorrectable sectors reported. I have since run fix-disk which ran to completion without errors.

af123 · Oct 7, 2015

It's possible that this is a 4K sector disk problem as suggested by neilski above.
If sector 123128 is a logical sector then it is the first one of 8 contained within the 4K physical sector. That's why I am interested in the output of 'smartctl -l selftest /dev/sda'.
Your approach of rewriting all of the logical sectors within the 4K one is what I would have suggested next. Glad it's sorted!
How many reallocated sectors does the disk now show?

MontysEvilTwin · Oct 7, 2015

af123 said:
It's possible that this is a 4K sector disk problem as suggested by neilski above.
If sector 123128 is a logical sector then it is the first one of 8 contained within the 4K physical sector. That's why I am interested in the output of 'smartctl -l selftest /dev/sda'.
Your approach of rewriting all of the logical sectors within the 4K one is what I would have suggested next. Glad it's sorted!
How many reallocated sectors does the disk now show?

It shows eight sectors reallocated now: I made a mistake in my original post, which I have now corrected (I thought there were seven sectors pending, but it was eight: I was looking in the wrong column at the time). This fits with my other observations; fix-disk had rewritten the first sector and I had to do seven manual reallocations, which equals eight. Thinking about it, after running fix-disk initially, it showed at least one reallocated sector (unsure of number now) but still had more pending which fix-disk couldn't deal with. It does seem probable that the issue is AF disk related with fix-disk rewriting the first sector only. I can't claim to have twigged this yesterday, I merely found that the next sector was unreadable and checked and rewrote the following sectors as necessary. I will run the test you suggested and post the output later.

MontysEvilTwin · Oct 7, 2015

af123 - the output of 'smartctl -l selftest /dev/sdb' is attached, but it does not have much info in it. Is this the test you intended?

Edit: here is the output of '4kalign' as this gives the sector boundaries:

Code:

--> This is an Advanced Format (AF) drive.

        Model Number:       ST1000VM002-1CT162                     
        Logical  Sector size:                   512 bytes
        Physical Sector size:                  4096 bytes
        Logical Sector-0 offset:                  0 bytes
        Nominal Media Rotation Rate: 5900

Disk /dev/sdb: 1000 GB, 1000202273280 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953520065 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *        2048     2107391     1060258   83  Linux
/dev/sdb2         2107392  1932543999   965217330   83  Linux
/dev/sdb3      1932544000  1953523711    10490445   83  Linux


*  OK   * - partiton /dev/sdb1 is properly aligned.
*  OK   * - partiton /dev/sdb2 is properly aligned.
*  OK   * - partiton /dev/sdb3 is properly aligned.

MontysEvilTwin · Oct 7, 2015

To add to the above:
LBA 123128 - 2048 (start sector) = 121080
121080 / 8 = 15135 (file system block number - see log in post #1)
So LBA 123128 is the first logical sector of a 4K sector (LBA 123128 to 123135).

af123 · Oct 7, 2015

So it seems that the entire 4K sector was unreadable (which makes sense, as the ECC will be for the entire physical sector).
Rewriting just the first 512 bytes of the sector was not enough to fix it - the same first logical sector just kept coming up as unreadable.
It looks like fix-disk could do with an update to read-test all logical sectors that make up a physical on an AF drive and fix any which report as unreadable.

neilski · Oct 7, 2015

Glad it's fixed! I haven't figured out the whole AF thing myself yet. It doesn't seem to make much sense for disks to continue to report sector numbers etc. as if the sectors were 512 bytes long but I guess there are backward-compatibility issues all over the place.

As for disks with read failures being on the way out, I must confess I haven't really given any disks much of a chance in the last decade, once they started to show any issues. Before that though, I did tend to stretch things out a bit (disks were more expensive then and it was worth prolonging things), and it always seemed like a fairly clear progression from one or two sectors to lots of sectors to ... toast...

Certainly in work, we toss 'em on the first bad sector (at least on desktops, but I think also in RAID arrays).
(That recent Toshiba was only a few months old when it gave hard read errors - I wasn't a happy bunny.)

My mental picture btw, rightly or wrongly (haven't done any fresh googling), was that each physical sector has a whole bunch of ECC info (I seem to recall 10%+ of the sector size) and thus multi-bit "errors" of considerable size are routinely fixed, much as happens on optical disks. A promise that lots of disk datasheets quote is 1e14 bits read without uncorrectable errors, which sounds like a lot but actually really isn't. (I laughed when I noticed on a recent datasheet - maybe it was the one for the Seagate Pipeline in fact - the garbled number "10e14", which looks a lot like a typo for 1e14!

)

Mr Parsnip · Oct 18, 2015

Just to add to this - I got the same eight sectors problem (LBA 27274504 on sda2 in my case) and using the same read/write tactic asMontysEvilTwin fixed everything (needed to write seven sectors - first already had zeros)

Block size: 4096
LBA 27274504 maps to file system block 3146248 on /dev/sda2

Incidentally, I initially got:

197 Current_Pending_Sector -O--C- 8 100 100 000 -
198 Offline_Uncorrectable ----C- 8 100 100 000 -

and fix-disk took this down to 7 (read showed first sector was already filled with zeros)

Before fix Hummy was responding sluggishly (e.g. pause on playback) and occasionally momentarily reporting recordings as encrytped - now all is fine, so a big (indirect) thanks top all.
I guess I need to keep an eye on SMART (would be nice if someone could get the e-mail warning working)

Suggest fix-disk needs soom more work for large drives, and documentation needs updating

Thanks again!

af123 · Oct 18, 2015

Mr Parsnip said:
I guess I need to keep an eye on SMART (would be nice if someone could get the e-mail warning working)

Adding this feature to the RS (Remote Scheduling) service would be the easiest since that can already send email messages that should be accepted by most ISPs and it already knows an email address for each box (there might be a small T&C update required unless they already cover the data required).
As has already been noted, it would also be possible to get the box to send emails directly using either the ssmtp or busybox/sendmail packages but that would also need a configuration screen where people could enter details of the email address, ISP smart-host server and other details such as credentials for authentication and encryption parameters. This screen did exist in the past for the epg-keywords package that was deprecated by RS.

I will probably add this email setup screen regardless and provide a central library method for packages to send email from the box but what do people think about adding the disk problem alerts to RS?

Suggest fix-disk needs soom more work for large drives, and documentation needs updating

I agree - xyz321 is the author and maintainer of the fix-disk utility, hopefully he can find some time to look at this.

xyz321 · Oct 18, 2015

It might be a week or two away. I have an external 2TB USB disk which seems to be terminal. The pending sector count keeps rising but it is in a critical area - one of the top level directories has become unreadable. I suppose it gives me something to test fix-disk on.

neilski · Oct 18, 2015

Well, I've had only the briefest of looks at the pieces needed to do the email warning thing. The main stumbling block for me tbh would be my almost total ineptitude with Jim. Pardon my ignorance, but is it fine to add functionality to the CF using the likes of shell scripts? (I'm presuming python wouldn't be ideal, or even bash, since not all CF boxes would have them.)

af123 said:
I will probably add this email setup screen regardless and provide a central library method for packages to send email from the box but what do people think about adding the disk problem alerts to RS?

Sounds like a pretty good idea to me, although I haven't got around to trying RS yet (this would be the clincher

).

xyz321 said:
It might be a week or two away. I have an external 2TB USB disk which seems to be terminal. The pending sector count keeps rising but it is in a critical area - one of the top level directories has become unreadable. I suppose it gives me something to test fix-disk on.

Yup, just so long as it's AF - I believe 2TB is the biggest size which can be non-AF.

xyz321 · Oct 18, 2015

It is AF but I will have to workaround 'hdparm' which barfs at the USB interface. The coreutils version of 'dd' can be used to zero out sectors in direct mode.

antipodean · Oct 18, 2015

af123 said:
but what do people think about adding the disk problem alerts to RS?

Sounds the most sensible way to me. Minimum setting up required.

Black Hole · Oct 18, 2015

I agree.

MartinLiddle · Oct 19, 2015

af123 said:
but what do people think about adding the disk problem alerts to RS?

My concern is that some users are overly sensitive to disk error reports and straight away replace the hard drive. I would suggest that increases in attributes 197 and 198 are always reported but increases in attribute 5 are only reported when the daily rate is more than some threshold; perhaps three sectors a day.

Pending Sector won't Reallocate

Well-Known Member

Well-Known Member

Well-Known Member

Member

Administrator

Well-Known Member

Administrator

Well-Known Member

Well-Known Member

Attachments

Well-Known Member

Administrator

Member

New Member

Administrator

Well-Known Member

Member

Well-Known Member

Active Member

May contain traces of nut

Super Moderator