fix-disk can't find pending sector LBA - SMART vagary

/df · Jul 18, 2018

Along with various other people, I've seen this problem with fix-disk:

!nd!go said:
... running fix-disk in maintenance mode was my best bet. However when I run it I get the following message "Error - pending sectors but LBA not found" and the process can't complete. This was the same when running the short or long check.
...

aldaweb said:
... run fixdisk from maintenance mode, but it terminates with the following:

Error - Pending sectors but LBA not found
fix-disk: session terminated with exit status 1

Press return to continue
...

free30 said:
I get the following message when trying to fix a pending sector in maintenance mode.
"Error - pending sectors but LBA not found"
It then goes no further. ...

moosey said:
...
I ran fixdisk without any options, and it first ran a short disk test and said that the LBA has not yet been found and asked to run a long test. This I did but at the end it said "Error - pending sectors but LBA not found", "fix-disk: session terminated with exit status 1".
...

In other cases posted on the forum it is clearly working as intended.

The problem that I've seen seems to come down to the behaviour of the SMART firmware. The logic in fix-disk's run_smarttest() seems to be this:

start the test
if 'no polling (-n)', wait for the recommended SMART polling interval for the test
otherwise after a short interval start polling the relevant SMART log to see if the test is still in progress; if not, assume completed and stop polling
examine the expected "Completed" SMART log entry to find the bad LBA.

This logic depends on the SMART firmware writing a "Completed" entry at the end of the test, which seems to happen reliably for me. The #3 logic also depends on the SMART firmware writing an "in progress" entry within a certain interval (2s). This doesn't always happen, for some combinations of disk models, SMART firmware and disk status. I'm sure I did see "in progress" log entries before but currently I'm not seeing any.

The #2 logic may fail if the failed LBA is at the end of the disk (measured by ascending LBA) and the test has taken somewhat longer than the SMART firmware predicted (maybe due to retries). But that's probably a feature of the -n option. The check at #4 could fail, seeing an "in progress" entry or a previous "Completed" or "aborted" entry, unless the "Completed" entry happened to mention the bad LBA sought, maybe from a previous run of fix-disk -n.

The #3 logic may fail because for some combinations of disk model, SMART firmware and disk status, the SMART log may never show any "in progress" entry, or the entry may appear after a longer delay than allowed for. This causes the #3 logic to appear to give up on finding the failed LBA if the last SMART test run before fix-disk was either aborted or found no error, or if the "in progress" entry has now appeared, because the entry at the top of the log expected by #4 doesn't exist, or is from an earlier successful test.

Maybe this also differs according to 'disk model, SMART firmware and disk status', but I found that the "Self-test execution status" output from smartctl -c was a reliable indicator of the test status. Thus the modified #3 logic as below:

start the test
if 'no polling (-n)', wait for the SMART polling interval for the test
otherwise:
- poll the SMART "Self-test execution status" until it shows that it's in progress
- wait for the SMART "Self-test execution status" to show that it's completed
examine the expected "Completed" SMART log entry to find the bad LBA.

My modified run_smarttest() routine replaces lines 463-479 of fix-disk 2017-01-07 as follows:

Code:

  i=$testtime    # recommended wait time read from smartctl -c; could add a margin for if/elif cases below?
  started=0      # in progress flag
  while true; do                                                      
    echo -ne "Waiting... $((i--))   \r                    
    if [ $no_poll -eq 1 ]; then                        
      [ $i -le 0 ] && break                                            
    elif [ $((cap_odc & 4)) -ne 0 ]; then  
      # "Abort Offline collection upon new command.", ie polling would stop test: why not treat this like -n above?
      break                              
    else                                 
      case "$(smartctl -c /dev/${dev} | grep "^Self-test execution status: ")" in
                                                                                
      *\ in\ progress*)                                                         
        [ $started -eq 0 ] && { started=1; i=$testtime; } 
        # could be cleaner to sleep longer here for large i before polling SMART again?
        ;;                                                                     
             
      *\ completed*)                                                            
        [ $started -ne 0 ] && break                                             
        i=$((i+1))      # previous test status, go round again until test in progress
        ;;                                                                     
                                                                                
      *)  # weird status
        if [ $started -ne 0 ]; then                                             
          echo -e "\nDisk self test reports unexpected status." | tee -a $TLOGFILE
          smartctl -c /dev/${dev} | grep -A 1 "^Self-test execution status: " | tee -a $TLOGFILE
          exit 1                                                                
        fi                                                                      
        ;;                                                                      
      esac                                                                      
      [ $i -lt 0 ] && i=0                                                       
    fi                                                                          
    sleep 1  # sleep after polling SMART in case "in progress" is very short                                                                    
  done

I hope this may be a useful input to the next fix-disk. Without needing to go into maintenance mode, the output of the following commands would indicate how reliable the modified logic is (I've included what I hope is an automatic way of finding the disk, but you could set $disk manually).

Code:

set disk=$(df /mod | tail -n +2)
set disk=${disk%%[0-9]*}
echo $disk # should be /dev/sda, /dev/sdb, etc
smartctl -t short $disk
smartctl -l selftest $disk | grep -c "^# 1 "
# does the output show "in progress"? If not, try it again after the next one
smartctl -c $disk | grep "^Self-test execution status: "
# does the output show "in progress"?
smartctl -X $disk # cancel the test

/df · Jul 21, 2018

And another thing ...

Should the -y option also agree the answer to:

Code:

LBA has not yet been found
A long test is required - this could take 6 hour(s) 37 minutes

In fix-disk 3.13 it doesn't.

Black Hole · Jul 22, 2018

I think not. Clearly this is a case for user decision.

/df · Aug 22, 2018

/df said:
...
My modified run_smarttest() routine replaces lines 463-479 of fix-disk 2017-01-07 as follows:
...

I should have said:

...
My modified run_smarttest() routine replaces lines 486-507 of fix-disk 2017-02-09 as follows:
...

It seems that the fix-disk package in the repository is an older version (0.5 2016-03-11) than that in HDR CF 3.13 (2017-02-09) and also different from the version (2017-01-07) that I originally had (maybe left-over from 3.12, or fix-flash-packages?). For instance the package version doesn't support the Cancel menu item in Maintenance mode. However the modified lines are the same in all 3 versions.

Is there a mechanism for setting dependencies between CF and add-on package versions?

What would happen after running diag fix-flash-packages? Presumably outdated-flash-package(s).

At any rate there is an opportunity to upgrade fix-disk-0.5 by applying something like the mod I proposed above to the CF version.

Edit:
Another outdated package in the repository:

prpr said:
There is a version of hdparm built into the CF ... There is also an installable package, which resides in /mod/sbin/hdparm when installed. This is a 'useless' package as the file is already built in to the CF. ...

CF hdparm 9.48 vs repo 9.43.

fix-disk can't find pending sector LBA - SMART vagary

/df

Well-Known Member

/df

Well-Known Member

Black Hole

May contain traces of nut

/df

Well-Known Member