I think my box is dying

prpr · Jul 11, 2021

jerrytaff said:
What I really meant is to ask

At school, when doing exams, they told us to read the question and answer the same one, not a different one.
There is often an art in asking the right question.

jerrytaff said:
whether the data is accessible in an easily readable form, and if so where?

It's in the 'temp' table in an Sqlite3 database in /mod/monitor/monitor.db
Whether that makes it easily readable to you I couldn't say, but it's in a relatively straightforward format. There is certainly plenty of it, so it depends exactly what you are trying to retrieve as to the next step.

Black Hole · Jul 11, 2021

prpr said:
It's in the 'temp' table in an Sqlite3 database in /mod/monitor/monitor.db

Is there a specific reason this does not appear in the database browser list?

MymsMan · Jul 11, 2021

jerrytaff said:
When going back far enough to see the problem, the graphic only takes one sample point per day.

To prevent the database growing too big I believe that the sample data is pruned after a while so you wouldn't see the full details even by looking in the database

prpr · Jul 11, 2021

Black Hole said:
Is there a specific reason this does not appear in the database browser list?

You wouldn't want to load the thing onto a web page. It would kill the CPU on the Humax and probably your browser too.
There are nearly 2 million rows just in the 'temp' table on mine.

Black Hole · Jul 11, 2021

xyz321 · Jul 12, 2021

Mine has about 3700 entries dating back to 2017.

prpr · Jul 12, 2021

Interesting...
I've just checked the 4 HDRs I manage and find one has 4904, the other three have 1.5M, 1.9M and 2.1M items in 'temp'.
And similar in 'vmstat' expect that one of them has a corrupt table and is thus unreadable.
Do you know how the expiry is supposed to work, 'cos obviously it isn't properly.

xyz321 · Jul 12, 2021

I haven't yet found any expiry code.

xyz321 · Jul 12, 2021

Black Hole said:
Great, I wasn't sure of that and have never been prepared to try it. So you end up with the WebIF download screen?

There seems to be a bug somewhere. Initially it comes up with the no internal disk found web page but after after waiting a minute or so and going back to the 'Still initialising page', it then correctly shows the webif installation screen.
It was taking quite a while to delete the contents of /mod so I think there may be a race condition.

prpr · Jul 12, 2021

xyz321 said:
I haven't yet found any expiry code.

Maintenance seems to be achieved via "/mod/monitor/run -p" and "... -r" with "-d" to see what's going on. But I don't know if this gets called anywhere.
On the initial box, the 'net' table was corrupt, which stopped "-p" working.
Running "... -r -d" has just generated a run ending like this:

Code:

...
vacuuming
/mod/monitor/lib/db.jim:123: Error: database or disk is full
in procedure 'record' called at file "/mod/monitor/bin/smart", line 35
in procedure 'purge' called at file "/mod/monitor/lib/db.jim", line 78
in procedure '_vacuum' called at file "/mod/monitor/lib/db.jim", line 137
at file "/mod/monitor/lib/db.jim", line 123

The files is still 172MB in size and there is plenty of disk space. More investigation needed.

jerrytaff · Jul 12, 2021

Mine has 3098 rows of data for each of net, smart, temp , vmstat. Each entry has a time in seconds since 1st Jan 1970, going back to Aug 10 2017, which is when I first installed the drive. Each entry is labelled "weekly", "daily", "hourly", "15min", "min" or "raw". It seems to record every minute, and then duplicates the 5th as a 5min, every 15th as a 15min etc, After 1 day it deletes the "raw"s, Later it deletes the "5min"s etc, but right now I can't tell when, as the cut-off was in the period when I didn't have a working box. The oldest entries I have are all daily and weekly. It doesn't ever seem to delete the "daily"s. Approximately 1/3 of my data is from the last day, so,assuming it never deletes the oldest entries, that's only 365 extra entries a year.

To get to millions, it must be keeping all the "raw" entries. I wonder if it schedules the cleanup at a particular time when your boxes are off. I keep mine on all the time.

jerrytaff · Jul 12, 2021

If you are interested, I used WINSCP to copy the file onto a PC (requires custom package greenend-sftp) to be installed , used DB Brower for SQLite to view the data, and then copied and pasted into excel to analyse the data. I wouldn't want to do that with millions of rows of data. I also used https://www.unixtimestamp.com/ to convert some of the critical timestamps to dates and times.

xyz321 · Jul 12, 2021

prpr said:
Maintenance seems to be achieved via "/mod/monitor/run -p" and "... -r" with "-d" to see what's going on. But I don't know if this gets called anywhere.
On the initial box, the 'net' table was corrupt, which stopped "-p" working.
Running "... -r -d" has just generated a run ending like this:

Ah, it's in /mod/monitor/lib/db.jim.

Code:

proc purge {} {
      mondebug "Purging."

      $::mondb query {begin transaction}
      _purge raw 1
      _purge 5min 5
      _purge 15min 10
      _purge hourly 15
      # Never purge daily or weekly rollup data
      $::mondb query {commit}

      _vacuum
}

xyz321 · Jul 12, 2021

purge is called from record every 30 minutes when the -p option is not used.

Code:

if {$::fpurge || $minute == 30} { purge }

prpr · Jul 12, 2021

I've been through 5 databases now (4 HDR and 1 HD) and all bar one HDR have corrupt tables, not always the same one.
Because of the way 'run' calls the things in 'bin', if an earlier table is corrupt it stops the later ones from processing.
This is really bad - there needs to be an exception handler round each invocation of the 'bin' scripts which would at least make sure the non-corrupt tables continued to function properly.
In at least one of mine, the 'net' table was corrupt which stopped processing on all the others, as 'net' is first alphabetically.

There is also a distinct lack of indexes on this database, which makes all the rollup/prune operations take much longer than they should, especially so under error conditions.

prpr · Jul 12, 2021

jerrytaff said:
The oldest entries I have are all daily and weekly. It doesn't ever seem to delete the "daily"s.

The code specifically says it doesn't ever delete 'daily' or 'weekly' data.
I can't imagine why, since the GUI limits you to a year, which is the limit for purging the other rows.
Keeping stuff from about 5 years ago seems utterly pointless to me, especially as there's no effective way to view it.

/df · Jul 12, 2021

sqlite3 in WebShell?

The Jim scripts that are called don't make any obvious attempt to trap errors or return status but apparently jimsh terminates with return code 1 if a script raises an unhandled exception.

I think the run script would be more useful like this:

Code:

#!/bin/sh

LOGFILE=/mod/tmp/monitor.log

datelog() {
    while read -r line; do
        printf "%s: %s\n" "$(date -u +'%F %T')" "$line"
    done
}

main() {
    ret=0
    for f in /mod/monitor/bin/*; do
        [ -x "$f" ] || continue
        if ! "$f" "$@"; then
            x=$?
            [ "$ret" = 0 ] && ret=$x
            echo "$f" "failed ($x)"
        fi
    done
    return $ret
}
              
main "$@" 2>&1 | datelog >> "$LOGFILE"

As the program is run once a minute by cron, success isn't logged.

(Much easier to post after realising that Markdown is accepted!)

MartinLiddle · Jul 12, 2021

jerrytaff said:
Finally, prior to the old box dying, S.M.A.R.T. data graphic from Sysmon showed no errors until Jan this year when the offline count went to 8. On 30th April it jumped to over 1000. increasing to 1500 the next day. That was 3 weeks before it started overheating. When it finally went into the reboot loop, according to Sysmon, the values changed to 16 offline and 72 reallocated. They have remained at that level since moving it to the new box.

How likely do you think it is it that the disk is on borrowed time, having been stressed by whatever it was subjected to when the old box failed?

If you are saying the offline count is 1500 you really need to run fix-disk and see what happens. If all 1500 become reallocated sectors then that would be a bit worrying.

jerrytaff · Jul 16, 2021

I said it was reported as being 1500 prior to the box failing, but having put it in a difference box, without running fix-disk, it has reduced to 16.

I think my box is dying

prpr

Well-Known Member

Black Hole

May contain traces of nut

MymsMan

Ad detector

prpr

Well-Known Member

Black Hole

May contain traces of nut

xyz321

Well-Known Member

prpr

Well-Known Member

xyz321

Well-Known Member

xyz321

Well-Known Member

prpr

Well-Known Member

jerrytaff

Member

jerrytaff

Member

xyz321

Well-Known Member

xyz321

Well-Known Member

prpr

Well-Known Member

prpr

Well-Known Member

/df

Well-Known Member

MartinLiddle

Super Moderator

jerrytaff

Member