Disk issue - advice please

SimpleSim

Member
Hi
Just back from a weeks holiday so the box has only been recording and not watching at the same time.

I have noticed that my recordings jump and stutter, sometimes if I rewind it will play the same part ok other times it will replay the same part with the same skips. Old recordings I know are ercorded ok skip around also. When I run the standard HDD test I get the following

HDD test fail
You may recover the HDD through Format Storage. (Error code: 8)

So I have seen another thread on a disk problem so I have tried to use the same tools on my box.

>>> Beginning diagnostic diskattr
Running: diskattr
smartctl 5.41 2011-06-09 r3365 [7405b0-smp-linux-2.6.18-7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: Seagate Pipeline HD 5900.2
Device Model: ST31000424CS
Serial Number: 5VX2H6ZE
LU WWN Device Id: 5 000c50 044930c5c
Firmware Version: SC13
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Mon Nov 5 17:21:37 2012 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 085 084 006 Pre-fail Always - 189654418
3 Spin_Up_Time 0x0003 095 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 097 097 020 Old_age Always - 3480
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail Always - 78841377
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2261
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1740
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 2400
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 099 099 000 Old_age Always - 1
190 Airflow_Temperature_Cel 0x0022 051 044 045 Old_age Always In_the_past 49 (1 44 49 49)
194 Temperature_Celsius 0x0022 049 056 000 Old_age Always - 49 (0 14 0 0)
195 Hardware_ECC_Recovered 0x001a 045 037 000 Old_age Always - 189654418
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0


>>> Ending diagnostic diskattr


I will post some extra stuff below...
 
humax# smartctl --test=short /dev/sda
smartctl 5.41 2011-06-09 r3365 [7405b0-smp-linux-2.6.18-7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line
mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mod
e" successful.
Testing has begun.
Please wait 1 minutes for test to complete.
Test will complete after Mon Nov 5 15:56:16 2012
Use smartctl -X to abort test. humax# humax# smartctl -l selftest /dev/sda
smartctl 5.41 2011-06-09 r3365 [7405b0-smp-linux-2.6.18-7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA
_of_first_error
# 1 Short offline Completed: read failure 90% 2260 909
913325
# 2 Short offline Completed: read failure 90% 2260 909
913325
# 3 Short offline Completed: read failure 90% 2260 909
913325
# 4 Short offline Completed: read failure 90% 2259 909
913325
# 5 Short offline Completed: read failure 90% 2259 909
913325
# 6 Short offline Completed: read failure 90% 2259 909
913325
 
humax# fdisk -lu /dev/sda

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes

Device Boot Start End Blocks Id System
/dev/sda1 2 2104514 1052256+ 83 Linux
/dev/sda2 2104515 1932539174 965217330 83 Linux
/dev/sda3 1932539175 1953520064 10490445 83 Linux
humax#
humax# /sbin/tune2fs -l /dev/sda2
tune2fs 1.41.14 (22-Dec-2010)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: f774ede7-525b-41dd-8b16-f270652e1f9f
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype n
eeds_recovery sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 60334080
Block count: 241304332
Reserved block count: 12065216
Free blocks: 84665574
Free inodes: 60331591
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 966
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Filesystem created: Sat Jan 1 00:00:17 2000
Last mount time: Mon Nov 5 12:00:10 2012
Last write time: Mon Nov 5 12:00:10 2012
Mount count: 1733
Maximum mount count: 37
Last checked: Sat Jan 1 00:00:17 2000
Check interval: 15552000 (6 months)
Next check after: Thu Jun 29 01:00:17 2000
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Journal inode: 8
Default directory hash: tea
Directory Hash Seed: fe7b3f3b-b765-4792-a80d-7962c3a54863
Journal backup: inode blocks
 
So I think I have to use this forula fsblock = (int)((<problem LBA>-<partition start LBA>)*<sector size>/<fs block size> to get the block

problem LBA 909,913,325
part start 2,104,515
sector size 512
fs block size 4096
humax# debugfs
debugfs 1.41.14 (22-Dec-2010)
debugfs: open /dev/sda2
debugfs: testb 113476101
debugfs: debugfs: Block 113476101 not in use
debugfs: debugfs:

I did wait a while for it to tell me that.

So what do I do now?
 
Ok so I have taken the next step dd if=/dev/zero of=/dev/sda2 bs=4096 count=1 seek=113476101

When I have run diskattr again I no have Current_Pending_Sector = 0 and Offline_Uncorrectable = 0

I will see how the playback goes.

Is it possible that it will have affected some recent recordings?

Thanks
 
Hopefully that will fix your playback problems, nicely done. Since the fault occurred 90% into the short test it is likely just the one problem sector which you've now fixed.
The block wasn't in use so no files will have been affected.

It's worth running fix-disk again to do a full filesystem check and a long disk self test, although that will take a long time.
 
'nicely done' - it all down reading and blindly following advice from other postings - so nicely done back at you!

Not sure what fix-disk is. I have re-run the normal humax HDD test and got a pass. Also from telnet did the smartctl --test=short /dev/sda
and smartctl -l selftest /dev/sda thing and that showed as completed without errors. I have started the long test (smartctl --test=long /dev/sda) and will have to wait till tomorrow for that to complete. Do I get the results in the same way (smartctl -l selftest /dev/sda)?

Thanks
 
"fix-disk" is a command to type at the humax# prompt, which would have run your fixes more effectively because it stops the normal Humax operations while it does it, and creates some swap space. Not sure where you got the idea to run "fdisk" raw, the fact you managed it is "nicely done".

Next time, just get the Telnet humax# prompt and type "fix-disk" (you need CF 2.12+).
 
fix-disk wouldn't have fixed his underlying sector fault, the steps taken were right for that. fix-disk would be useful now to confirm that the filesystem is completely intact.
 
Thanks I will run fix-disk when the long test has run and I can get the box to myself now its doesn't stutter any more :)

Sorry I didn't look for fix-disk in the wiki.

All the things I did came from this other forum post http://hummy.tv/forum/threads/hard-drive-failure.2482/ with a bit of help from google as rob4x4 had a greater base understanding than me so he didn't need step-by-step instructions.
 
What we need now is an automated process to track down bad sectors!

I guess it depends on the ease of development, frequency of occurrence and the goodwill of someone with the skills to do it!

For someone who hasn't used telnet before I was able to follow the steps others described in the forums and fix my bad sector. I have yet to look at the results of the long test - the box had switched itself off sometime before its daily early morning wake up - see if I can get my hands on it tonight. From first appearance it is working properly again.

Without the help of this forum and associated developments I would have had to use the standard menu format option and lost all my recordings so the help is (as always) much appreciated :)
 
What we need now is an automated process to track down bad sectors!
It will probably come as an extension to the new disk diagnostics page in the web interface... the number of HDRs that are approaching the two year old mark has triggered a large number of disk problems it seems!
 
Phew, extended offline check completed without errors.

I have run fix-disk (windows 7 using unset crlf option) and I think it all ran ok. One question it asked which I was not expecting was 'lost+found not found. Create? ' As everyone needs a lost and found I said Yes!! (is that okay?)

Is this a sign of a upcoming problem with the disk which I should look at replacing in a controlled manner or is it likely to be a one-off?

Thanks again to the usual suspects for their invaluable help :)

Full log from fix-disk pasted below


humax# fix-disk

Checking disk sda



Unmounted /dev/sda1

Unmounted /dev/sda2

Unmounted /dev/sda3



Checking partition /dev/sda3...

e2fsck 1.41.14 (22-Dec-2010)

Pass 1: Checking inodes, blocks, and sizes

☻☺

☻Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

/lost+found not found. Create? yes



Pass 4: Checking reference counts

☻☺

☻Pass 5: Checking group summary information

☻☺/dev/sda3: |========================================================| 100.0%

☻☺



/dev/sda3: ***** FILE SYSTEM WAS MODIFIED *****

/dev/sda3: 13/655776 files (0.0% non-contiguous), 90356/2622611 blocks



Checking partition /dev/sda1...

e2fsck 1.41.14 (22-Dec-2010)

Pass 1: Checking inodes, blocks, and sizes

☻☺

☻Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

/lost+found not found. Create? yes



Pass 4: Checking reference counts

Pass 5: Checking group summary information

☻☺/dev/sda1: |========================================================| 100.0%

☻☺



/dev/sda1: ***** FILE SYSTEM WAS MODIFIED *****

/dev/sda1: 15/65808 files (13.3% non-contiguous), 15653/263064 blocks



Creating swap file...

Setting up swapspace version 1, size = 1073737728 bytes



Checking partition /dev/sda2...

e2fsck 1.41.14 (22-Dec-2010)

Pass 1: Checking inodes, blocks, and sizes

☻☺

☻Pass 2: Checking directory structure

☻☺

☻Pass 3: Checking directory connectivity

☻☺

☻/lost+found not found. Create? yes



yPass 4: Checking reference counts

☻☺

☻Pass 5: Checking group summary information

☻☺/dev/sda2: |========================================================| 100.0%

☻☺



/dev/sda2: ***** FILE SYSTEM WAS MODIFIED *****

/dev/sda2: 2506/60334080 files (13.5% non-contiguous), 157386878/241304332 block

s

Removing extra swap space.

Are you having problems with a delete loop [Y/N]? n



Finished - type 'reboot' to return to normal operation

humax#
 
You might find some of your recordings have disappeared, when broken references were gathered up into lost+found.
 
oh :(, with the extended offline check being successful I thought I had got away without any damage! Does the log indicate some other stuff was bad then?
 
That log looks fine. The checker just creates the lost+found directories as part of its run if they aren't already present.
 
I said "might"! However, the log does say the file system was changed, implying to me (at least) that something got repaired. When I ran fix-disk recently, when I was having some difficulties with the custom software installation, some of the CF files disappeared up the lost+found jaxie.

BUT af123 is the guru, don't listen to me over him.
 
Back
Top