sda1 recovery failed - zero length partition

Brian Burch · Dec 29, 2012

xyz321 said:
~~After a reboot are the partitions still present?~~
Edit: I see you haven't rebooted in which case it would be best to use fdisk to confirm that the partition table is correct. If that produces an I/O error then use hexdump to dump the partition table.

Oh dear... I saw your original email, but the edit had not yet arrived!

So I did a reboot. /sys/block/sda still had the correct size, but sda1-3 all disappeared. Not surprisingly, tune2fs didn't find them any more either. I guess the dd of zeros really did something!!!

fix-disk back to maintenance mode... create the partitions, verify, write: error on read, error on write. Instead of cancelling, I said ignore and got this:

Code:

Error: Partition(s) 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64 on /dev/sda have been
written, but we have been unable to inform the kernel of the change, probably
because it/they are in use.  As a result, the old partition(s) will remain in
use.  You should reboot now before making further changes.

I don't understand what this message means, but reboot is the only meaningful option to discover what the partition table really looks like now.

What a relief! /sys/block/sda/sda[1-3] have reappeared and are unchanged. gfdisk still says I don't have a partition table, but I'm very relieved that tune2fs likes sda2 again.

Code:

humax# hexdump -Cv -n 512 /dev/sda
hexdump: /dev/sda: Input/output error

Well, that didn't help much...

xyz321 said:
I would recommend running a smartctl test on the disk - this should find the location of the problem sector(s). You may not be able to install it using the package management system if /dev/sda2 is not accessible or read-only. In this case it can be copied onto the flash drive and run from there.

I didn't know how to do that, but I found Manually_loading_Features_from_USB on the wiki.

Code:

humax# find . -name "smart*"
./media/drive1/smartmontools_5.41_mipsel.opk
./sys/module/psmouse/parameters/smartscroll
humax#

The LED on the usb stick flashed and the humax (in normal mode, after power-cycle) showed a popup about detecting the usb media. However, It doesn't look to me as if the package installed to the flash drive. I will try a different usb stick, but I wonder whether the bad hard disk is causing the installation to fail?

xyz321 · Dec 29, 2012

I think we need to repair the disk before attempting to sort out the partition table. Is /dev/sda2 mounted? If not can it be mounted on /mnt/hd2? Has a log file been created on your USB stick?

smartctl from the smartmontools package is attached - note it is not a zip file, just rename it to smartctl and check that it has execute permissions.
Upload it to a directory in the flash partition /var/lib/humaxtv/mod and run it from there:

Code:

cd /var/lib/humaxtv/mod
./smartctl -a /dev/sda

Edit: smartctl.zip deleted - no longer required

Brian Burch · Dec 29, 2012

xyz321 said:
I think we need to repair the disk before attempting to sort out the partition table. Is /dev/sda2 mounted? If not can it be mounted on /mnt/hd2? Has a log file been created on your USB stick?

smartctl from the smartmontools package is attached - note it is not a zip file, just rename it to smartctl and check that it has execute permissions.
Upload it to a directory in the flash partition /var/lib/humaxtv/mod and run it from there:

Code:

cd /var/lib/humaxtv/mod ./smartctl -a /dev/sda

Yes, I was able to mount /dev/sda2 on /mnt/hd2, and was somewhat surprised to discover that /mnt/hd2/mod/bin/smartctl was there already - I must have installed it when the web interface was working. I didn't need your version, because mine ran OK. I thought it was best to post the report verbatim and then try to analyse it myself. Don't wait for me if you discover anything interesting!

Code:

humax# ./smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [7405b0-smp-linux-2.6.18-7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Pipeline HD 5900.2
Device Model:     ST31000424CS
Serial Number:    5VX2VZJF
LU WWN Device Id: 5 000c50 048d17ad5
Firmware Version: SC13
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Dec 29 14:34:25 2012 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  633) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 220) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   098   092   006    Pre-fail  Always       -       204765846
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1134
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       19946632
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       882
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       567
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       324
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   050   041   045    Old_age   Always   In_the_past 50 (0 18 53 50)
194 Temperature_Celsius     0x0022   050   059   000    Old_age   Always       -       50 (0 17 0 0)
195 Hardware_ECC_Recovered  0x001a   047   032   000    Old_age   Always       -       204765846
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 324 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 324 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:44.492  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:44.467  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:43.149  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:43.124  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:41.817  READ DMA

Error 323 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:43.149  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:43.124  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:41.817  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:41.791  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:40.474  READ DMA

Error 322 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:41.817  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:41.791  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:40.474  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:40.448  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:39.151  READ DMA

Error 321 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:40.474  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:40.448  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:39.151  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:39.126  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:37.806  READ DMA

Error 320 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:39.151  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:39.126  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:37.806  READ DMA
  b0 d5 01 e0 4f c2 00 00      00:01:58.966  SMART READ LOG
  ec 00 00 00 00 00 00 00      00:00:07.878  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%       862         16
# 2  Short offline       Completed: read failure       90%       831         16

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

humax#

xyz321 · Dec 29, 2012

There is a fault at sector 16 so try clearing the block with:

Code:

dd if=/dev/zero of=/dev/sda bs=4096 seek=2 count=1

Then run a short smart selftest using:

Code:

smartctl -t short /dev/sda

After this has completed (a few minutes) look at the results using:

Code:

smartctl -a /dev/sda

Edit: changed incorrect seek value above

Brian Burch · Dec 29, 2012

xyz321 said:
There is a fault at sector 16 so try clearing the block with:

Code:

dd if=/dev/zero of=/dev/sda bs=4096 seek=2 count=1

Then run a short smart selftest using:

Code:

smartctl -t short /dev/sda

After this has completed (a few minutes) look at the results using:

Code:

smartctl -a /dev/sda

Edit: changed incorrect seek value above

This time I noticed your edit and so used the correct seek value.

Code:

humax# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [7405b0-smp-linux-2.6.18-7.1] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Pipeline HD 5900.2
Device Model:     ST31000424CS
Serial Number:    5VX2VZJF
LU WWN Device Id: 5 000c50 048d17ad5
Firmware Version: SC13
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Dec 29 16:47:30 2012 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  633) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 220) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x103b) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   098   092   006    Pre-fail  Always       -       204863538
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1134
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       19947320
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       884
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       567
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       324
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   046   041   045    Old_age   Always   In_the_past 54 (0 18 54 47)
194 Temperature_Celsius     0x0022   054   059   000    Old_age   Always       -       54 (0 17 0 0)
195 Hardware_ECC_Recovered  0x001a   047   032   000    Old_age   Always       -       204863538
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 324 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 324 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:44.492  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:44.467  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:43.149  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:43.124  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:41.817  READ DMA

Error 323 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:43.149  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:43.124  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:41.817  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:41.791  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:40.474  READ DMA

Error 322 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:41.817  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:41.791  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:40.474  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:40.448  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:39.151  READ DMA

Error 321 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:40.474  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:40.448  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:39.151  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:39.126  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:37.806  READ DMA

Error 320 occurred at disk power-on lifetime: 881 hours (36 days + 17 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 10 00 00 00  Error: UNC at LBA = 0x00000010 = 16

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 20 00 00 00 e0 00      00:03:39.151  READ DMA
  ec 00 00 10 00 00 a0 00      00:03:39.126  IDENTIFY DEVICE
  c8 00 20 00 00 00 e0 00      00:03:37.806  READ DMA
  b0 d5 01 e0 4f c2 00 00      00:01:58.966  SMART READ LOG
  ec 00 00 00 00 00 00 00      00:00:07.878  IDENTIFY DEVICE

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       884         -
# 2  Short offline       Completed: read failure       90%       862         16
# 3  Short offline       Completed: read failure       90%       831         16

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
humax#

"SMART Self-test log" entry number 1 looks as if the error on LBA 16 has been corrected.

I will go back to maintenance mode now and try to define the partition table again.

Brian Burch · Dec 29, 2012

Brian Burch said:
I will go back to maintenance mode now and try to define the partition table again.

There was no need! When I rebooted in maintenance mode and used gfdisk to print the current partition table, it showed:

Code:

Disk /dev/sda: 1000 GB, 1000202273280 bytes
255 heads, 63 sectors/track, 121601 cylinders, total 1953520065 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot      Start         End      Blocks   Id  System 
/dev/sda1               2     2104514     1052226   83  Linux
/dev/sda2         2104515  1932539174   965209297   83  Linux
/dev/sda3      1932539175  1953520064    10482412   83  Linux
Command (m for help): v                                                   
Information: 5104 unallocated sectors

This is what I have been trying to achieve for the last several days! It seems the error of LBA 16 was preventing the partition table from being read properly.

Now I need to go back to the very beginning of this post and format /dev/sda1.

Brian Burch · Dec 29, 2012

Brian Burch said:
Now I need to go back to the very beginning of this post and format /dev/sda1.

Code:

mkfs.ext3 -m 0 -O sparse_super /dev/sda1

That worked fine.

After rebooting, a simple "df" showed me all three /dev/sdx's were mounted successfully.

Code:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/root                17280     17280         0 100% /
tmpfs                    62492        36     62456   0% /tmp
tmpfs                    62492         0     62492   0% /media
/dev/mtdblock1            2048       476      1572  23% /var/lib/humaxtv
/dev/mtdblock2            2048      1260       788  62% /var/lib/humaxtv_backup
/dev/sda1              1035692     34108   1001584   3% /mnt/hd1
/dev/sda2            950070404 365985224 535824316  41% /mnt/hd2
/dev/sda3             10325780    154264   9646996   2% /mnt/hd3

Also, the nasty "you need to format the HDD" panel has disappeared from my TV screen. I can use the humax remote control to scroll through my (no longer hidden) recorded programs and the first few I've tried play OK.

That was quite a roller-coaster ride. I like to think I know a lot about linux, but my understanding fades as I get nearer to "the metal". Thank you very much xyz321 for your advice, encouragement and knowledge. Is there any way I could buy you a couple of beers in gratitude?

xyz321 · Dec 29, 2012

It's good to know that it is all working. The epg database is the only thing stored on /dev/sda1 so that should be recreated. There is also a copy of the epg used by the web interface on that partition which I think will also be recreated.

It is interesting that the reallocated sector count is still zero even after fixing the bad sector.

Brian Burch · Dec 29, 2012

I have a few "debriefing" questions:

In hindsight, should we have recognised the root of this problem earlier? I think not because a) I didn't remember that I had already installed smartctl, and b) the symptoms were associated with the partition table at block zero.
Shouldn't the drive have self-corrected the problem automatically, or is that too much to expect?
Shouldn't there be a way for ordinary users, especially those who don't have customised firmware, to realise the dreaded "reformat your disk" action is not actually necessary? (I know they don't have any alternative actions, but perhaps they should).
In your experience, are there a lot of humax users trashing all of their recordings because of this class of potentially recoverable error? I will own up and say that this is my second humax hdr fox t2 1TB machine in 2012. The first one failed with exactly the same symptoms when it was about 2 months old. My wife said it was not acceptable (don't dare argue!), so we took it back to that lovely store that isn't knowingly undersold. They apologised profusely and handed us a new machine and a small discount voucher. Perhaps the second machine was part of the same manufacturing batch?
Why are the sda2 tune2fs maxmountcount (34) and checkinterval (6 months) defaults so slack? I use much tighter values on all my conventional linux systems, especially the servers that don't get rebooted from month to month.
When I first hit this problem, tune2fs said that sda2 had been mounted about 200 times (I've mislaid the piece of paper!) since it was last checked. Was this over-run due to the disk error that we have just fixed, or is the check not being triggered at all? I am unsure about which hardware state changes would cause the humax linux system to be booted.

xyz321 · Dec 29, 2012

Brian Burch said:
In hindsight, should we have recognised the root of this problem earlier? I think not because a) I didn't remember that I had already installed smartctl, and b) the symptoms were associated with the partition table at block zero.

Yes, we should have used smartctl as soon as the I/O errors started appearing. I think smartctl should be included as part of the custom firmware in case someone has a faulty machine which doesn't have it installed or /dev/sda2 is not mountable.

Shouldn't the drive have self-corrected the problem automatically, or is that too much to expect?

Probably but others have had similar problems before which don't seem to self correct.

Shouldn't there be a way for ordinary users, especially those who don't have customised firmware, to realise the dreaded "reformat your disk" action is not actually necessary? (I know they don't have any alternative actions, but perhaps they should).

One for Humax I think.

In your experience, are there a lot of humax users trashing all of their recordings because of this class of potentially recoverable error? I will own up and say that this is my second humax hdr fox t2 1TB machine in 2012. The first one failed with exactly the same symptoms when it was about 2 months old. My wife said it was not acceptable (don't dare argue!), so we took it back to that lovely store that isn't knowingly undersold. They apologised profusely and handed us a new machine and a small discount voucher. Perhaps the second machine was part of the same manufacturing batch?

I notice there is a post over on MyHumax from someone who has just reformatted their disk and therefore lost their recordings after seeing the "reformat your disk" message on screen.

Why are the sda2 tune2fs maxmountcount (34) and checkinterval (6 months) defaults so slack? I use much tighter values on all my conventional linux systems, especially the servers that don't get rebooted from month to month.

The startup scripts on the Humax do not use these settings so they will not be effective anyway. Presumably the assumption is that the ext3 journal will take care of all potential problems. If the checks were to be added to the startup it would delay startup which would mean failed recordings since a check on sda2 takes at least 30 minutes.

Brian Burch · Dec 29, 2012

Thanks for your comprehensive replies to my questions.

xyz321 said:
The startup scripts on the Humax do not use these settings so they will not be effective anyway. Presumably the assumption is that the ext3 journal will take care of all potential problems. If the checks were to be added to the startup it would delay startup which would mean failed recordings since a check on sda2 takes at least 30 minutes.

Well, based on this recent experience, I will regularly:

run smartctl and analyse the report for any signs of drive failure.
periodically run e2fsck on my /dev/sdx partitions - when they are not mounted, of course!
backup decrypted versions of my more precious recordings on another system.

I'm very grateful for all your help... you didn't answer the beer question!

xyz321 · Dec 29, 2012

Brian Burch said:
run smartctl and analyse the report for any signs of drive failure.

A smartctl report is also available from the web interface diagnostic page "Hard Disk". The main parameters to worry about are those with id 197 & 198. They will appear red if non-zero and this is when I/O errors will appear.

af123 · Dec 29, 2012

xyz321 said:
There is also a copy of the epg used by the web interface on that partition which I think will also be recreated.

Yes, it will be. There's nothing important on that partition. It might take a full reboot or two for everything to drop into place.

It is interesting that the reallocated sector count is still zero even after fixing the bad sector.

Writing to the sector makes the firmware re-evaluate it. In my experience its as likely to decide that it's ok now as to reallocate it.

These AV drives seem to exhibit strange behaviour in the presence of suspect sectors.

In the OP's case, the sector will have been reported as unreadable by the firmware unreadable hence the IO errors.

Congratulations to both of you for the persistence and getting it back up and running!

Btw, smartctl is automatically installed with recent versions of the webif package to support the disk diagnostics screens.

Brian Burch · Dec 29, 2012

af123 said:
Btw, smartctl is automatically installed with recent versions of the webif package to support the disk diagnostics screens.

Yes, now you mention it I can see it and run it from the web interface - of course, with my "you must reformat" problem, the web i/f was not available. However, I had not previously realised the extra packages were installed on /mnt/hd2/mod. If something nasty happens in future, I will check there first.

I will continue to monitor this hard disk drive. I have another 5 months before the 12 month warranty runs out, so I have plenty of time to decide whether to get the disk replaced. I checked the reference on this forum to http://www.humaxdigital.com/uk/registration/Registrations.aspx (wow! I qualify for posting url's at last!), and it seems the 24 month warranty evapourates if you don't register within 30 days of purchase.

You keep ignoring my "beer question"... I really appreciate all your help with my problem - I wouldn't have got there on my own. It will, I am sure, make you feel more satisfied to know that I contribute to several other open source forums, so my repayment to you is in the form of help others. That's the way it ought to work, isn't it?

af123 · Dec 29, 2012

xyz321 said:
I think smartctl should be included as part of the custom firmware in case someone has a faulty machine which doesn't have it installed or /dev/sda2 is not mountable.

I agree. One for the next version if it isn't too big.

(Your beer question was to xyz321 which is why I am not responding to it : )

Black Hole · Dec 29, 2012

Brian Burch said:
Shouldn't there be a way for ordinary users, especially those who don't have customised firmware, to realise the dreaded "reformat your disk" action is not actually necessary? (I know they don't have any alternative actions, but perhaps they should).

Not really. Very few of the PVR-buying public would even be up to even being helped to sort out problems like this, it requires far too much time, patience, and keeping your head together. Anyone with enough interest to do it is likely to hit Google and find us.

The general public has the nuclear options of restoring factory defaults and disk format (unfortunately the format option doesn't always cope with even factory installed 1TB drives), and if they don't fix it a 2 year warranty as a back stop. "Precious" recordings shouldn't be stored on the PVR, there are adequate options to move them to an external drive, but as I frequently point out it's only telly.

Regarding financial contributions, you can donate towards the upkeep of the Hummy.tv forum in the front page, or to the upkeep of the Remote Scheduling web service on the front page there. And yes, in my opinion helping out where you are able to (even if elsewhere) does repay the "debt" to the wider community.

prpr · Dec 30, 2012

af123 said:
I agree. One for the next version if it isn't too big.

What about dd?
Smartctl detects the problem, but dd fixes it.

Brian Burch · Jan 3, 2013

af123 said:
(Your beer question was to xyz321 which is why I am not responding to it : )

I wasn't paying enough attention and my relief at making progress caused me to see "af123" and "xyz321" as the same when in a small font in the left margin.

Sorry! My thanks to you too! I really appreciate the help everyone gave me, especially over Christmas. I would probably have given up and reformatted the disk if it had not been for you guys.

Brian Burch · Jan 3, 2013

Black Hole said:
Not really. Very few of the PVR-buying public would even be up to even being helped to sort out problems like this, it requires far too much time, patience, and keeping your head together. Anyone with enough interest to do it is likely to hit Google and find us.

The general public has the nuclear options of restoring factory defaults and disk format (unfortunately the format option doesn't always cope with even factory installed 1TB drives), and if they don't fix it a 2 year warranty as a back stop. "Precious" recordings shouldn't be stored on the PVR, there are adequate options to move them to an external drive, but as I frequently point out it's only telly.

Regarding financial contributions, you can donate towards the upkeep of the Hummy.tv forum in the front page, or to the upkeep of the Remote Scheduling web service on the front page there. And yes, in my opinion helping out where you are able to (even if elsewhere) does repay the "debt" to the wider community.

Thanks for explaining so patiently. I expect you've explained this elsewhere, but it is a useful postscript to my story.

With some regrets, I am inclined to agree with all your observations. It is an interesting general point now that "smart" devices are becoming so pervasive in our society (many of which are running embedded linux, incidentally). The user interfaces are designed to give the impression that everything is simple, yet the underlying logic is very sophisticated. The average user has very little awareness of what is happening when doing something "intuitively obvious" such as posting a photo on a smartphone to facebook. All that complex software is certain to be buggy and if it is so hard to debug, where is the commercial incentive to get it fixed? I think problems will increasingly be treated as "brand image marketing issues", with two main severity classes "serious" or "ignore".

Forums such as this need to exist for all the problems below the "iceberg water line", otherwise even knowledgeable users will be left with only the "nuclear options". I'm really pleased to have discovered you guys! Perhaps I'll meet you, with reversed roles, on some other forum.

Brian Burch · Jan 3, 2013

prpr said:
What about dd?
Smartctl detects the problem, but dd fixes it.

That is true, but without smartctl runnable on the machine, how would one know what was wrong and exactly which block to zap with dd?

sda1 recovery failed - zero length partition

Member

Well-Known Member

Member

Well-Known Member

Member

Member

Member

Well-Known Member

Member

Well-Known Member

Member

Well-Known Member

Administrator

Member

Administrator

May contain traces of nut

Well-Known Member

Member

Member

Member