Along with various other people, I've seen this problem with fix-disk:
In other cases posted on the forum it is clearly working as intended.
The problem that I've seen seems to come down to the behaviour of the SMART firmware. The logic in fix-disk's
This logic depends on the SMART firmware writing a "Completed" entry at the end of the test, which seems to happen reliably for me. The #3 logic also depends on the SMART firmware writing an "in progress" entry within a certain interval (2s). This doesn't always happen, for some combinations of disk models, SMART firmware and disk status. I'm sure I did see "in progress" log entries before but currently I'm not seeing any.
The #2 logic may fail if the failed LBA is at the end of the disk (measured by ascending LBA) and the test has taken somewhat longer than the SMART firmware predicted (maybe due to retries). But that's probably a feature of the -n option. The check at #4 could fail, seeing an "in progress" entry or a previous "Completed" or "aborted" entry, unless the "Completed" entry happened to mention the bad LBA sought, maybe from a previous run of
The #3 logic may fail because for some combinations of disk model, SMART firmware and disk status, the SMART log may never show any "in progress" entry, or the entry may appear after a longer delay than allowed for. This causes the #3 logic to appear to give up on finding the failed LBA if the last SMART test run before fix-disk was either aborted or found no error, or if the "in progress" entry has now appeared, because the entry at the top of the log expected by #4 doesn't exist, or is from an earlier successful test.
Maybe this also differs according to 'disk model, SMART firmware and disk status', but I found that the "Self-test execution status" output from
My modified
I hope this may be a useful input to the next fix-disk. Without needing to go into maintenance mode, the output of the following commands would indicate how reliable the modified logic is (I've included what I hope is an automatic way of finding the disk, but you could set $disk manually).
... running fix-disk in maintenance mode was my best bet. However when I run it I get the following message "Error - pending sectors but LBA not found" and the process can't complete. This was the same when running the short or long check.
...
... run fixdisk from maintenance mode, but it terminates with the following:
Error - Pending sectors but LBA not found...
fix-disk: session terminated with exit status 1
Press return to continue
I get the following message when trying to fix a pending sector in maintenance mode.
"Error - pending sectors but LBA not found"
It then goes no further. ...
...
I ran fixdisk without any options, and it first ran a short disk test and said that the LBA has not yet been found and asked to run a long test. This I did but at the end it said "Error - pending sectors but LBA not found", "fix-disk: session terminated with exit status 1".
...
In other cases posted on the forum it is clearly working as intended.
The problem that I've seen seems to come down to the behaviour of the SMART firmware. The logic in fix-disk's
run_smarttest()
seems to be this:- start the test
- if 'no polling (-n)', wait for the recommended SMART polling interval for the test
- otherwise after a short interval start polling the relevant SMART log to see if the test is still in progress; if not, assume completed and stop polling
- examine the expected "Completed" SMART log entry to find the bad LBA.
This logic depends on the SMART firmware writing a "Completed" entry at the end of the test, which seems to happen reliably for me. The #3 logic also depends on the SMART firmware writing an "in progress" entry within a certain interval (2s). This doesn't always happen, for some combinations of disk models, SMART firmware and disk status. I'm sure I did see "in progress" log entries before but currently I'm not seeing any.
The #2 logic may fail if the failed LBA is at the end of the disk (measured by ascending LBA) and the test has taken somewhat longer than the SMART firmware predicted (maybe due to retries). But that's probably a feature of the -n option. The check at #4 could fail, seeing an "in progress" entry or a previous "Completed" or "aborted" entry, unless the "Completed" entry happened to mention the bad LBA sought, maybe from a previous run of
fix-disk -n
.The #3 logic may fail because for some combinations of disk model, SMART firmware and disk status, the SMART log may never show any "in progress" entry, or the entry may appear after a longer delay than allowed for. This causes the #3 logic to appear to give up on finding the failed LBA if the last SMART test run before fix-disk was either aborted or found no error, or if the "in progress" entry has now appeared, because the entry at the top of the log expected by #4 doesn't exist, or is from an earlier successful test.
Maybe this also differs according to 'disk model, SMART firmware and disk status', but I found that the "Self-test execution status" output from
smartctl -c
was a reliable indicator of the test status. Thus the modified #3 logic as below:- start the test
- if 'no polling (-n)', wait for the SMART polling interval for the test
- otherwise:
- poll the SMART "Self-test execution status" until it shows that it's in progress
- wait for the SMART "Self-test execution status" to show that it's completed
- examine the expected "Completed" SMART log entry to find the bad LBA.
My modified
run_smarttest()
routine replaces lines 463-479 of fix-disk 2017-01-07 as follows:
Code:
i=$testtime # recommended wait time read from smartctl -c; could add a margin for if/elif cases below?
started=0 # in progress flag
while true; do
echo -ne "Waiting... $((i--)) \r
if [ $no_poll -eq 1 ]; then
[ $i -le 0 ] && break
elif [ $((cap_odc & 4)) -ne 0 ]; then
# "Abort Offline collection upon new command.", ie polling would stop test: why not treat this like -n above?
break
else
case "$(smartctl -c /dev/${dev} | grep "^Self-test execution status: ")" in
*\ in\ progress*)
[ $started -eq 0 ] && { started=1; i=$testtime; }
# could be cleaner to sleep longer here for large i before polling SMART again?
;;
*\ completed*)
[ $started -ne 0 ] && break
i=$((i+1)) # previous test status, go round again until test in progress
;;
*) # weird status
if [ $started -ne 0 ]; then
echo -e "\nDisk self test reports unexpected status." | tee -a $TLOGFILE
smartctl -c /dev/${dev} | grep -A 1 "^Self-test execution status: " | tee -a $TLOGFILE
exit 1
fi
;;
esac
[ $i -lt 0 ] && i=0
fi
sleep 1 # sleep after polling SMART in case "in progress" is very short
done
I hope this may be a useful input to the next fix-disk. Without needing to go into maintenance mode, the output of the following commands would indicate how reliable the modified logic is (I've included what I hope is an automatic way of finding the disk, but you could set $disk manually).
Code:
set disk=$(df /mod | tail -n +2)
set disk=${disk%%[0-9]*}
echo $disk # should be /dev/sda, /dev/sdb, etc
smartctl -t short $disk
smartctl -l selftest $disk | grep -c "^# 1 "
# does the output show "in progress"? If not, try it again after the next one
smartctl -c $disk | grep "^Self-test execution status: "
# does the output show "in progress"?
smartctl -X $disk # cancel the test