Media List Sort Order

Thanks /df for your comments, very helpful.

The main issue, the sort order, can be fixed by not putting a 0x15 in the Title, since the system seems to treat that as UTF-8 by default. But that means changing the hmt utility.

That's certainly what I hoped for as a solution. Since it has only recently been changed to insert the X'15', I'm hoping that it wouldn't be difficult to undo, or cause any other problems. From the comments so far, it seems that all the problems might have a common source, and therefore the fix might be quite easy..

I'm sure we could come up with a Jim script that would retrospectively remove any 0x15 from the Title in a .hmt, and so to all in a folder, etc.

I'm sure that's true, but I'm not sure that it's worth the effort, if we fix the underlying problem. It seems to be only me that's affected by it, and I can sort out my own problems by other means. (Apologies for the pun!)

The tag in the ITitle is explained once one remembers that SD and HD have different formats. Is this a standard feature, or related to the CFW EPG decoder?

Certainly the X15' for HD and X'106937' for SD seems to be pervasive in unmodified recordings. I have to confess that it had never occurred to me that the CFW, rather than the base Humax firmware might have been creating these prefixes. If it would help, I can reinstall one of my boxes with old (1.0.32) base firmware and see how it encodes SD and HD recordings?

From the spreadsheet, the filename position issue seems to be localised to the period before 2014, and so to whatever OEM firmware version applied then.

To some extent, it depends how we define the issue. The base firmware sets it at 0180. At some stage in the past, in fact no later than 2011, it would appear that the custom firmware changed this to 017F, except for brief period in 2014 when it temporarily reverted to 0180. 017F has effectively become the custom firmware standard every since. It would appear that both the base and custom firmware (hmt?) can accommodate either.

So we could, if we wanted uniformity, standardise on 0180. Or we could follow industry standard practice and declare it to be 'working as designed', and claim that that the change was made to provide a handy way to differentiate between 'processed' and 'unprocessed' recordings. (Maybe it really was?)

Grantchester (2014) looks to have been recorded before the firmware was upgraded.

I'm a bit confused by this comment. Whilst the first episode of Grantchester was broadcast in 2014, this episode (Series 7 Episode 2) was broadcast last Friday on a system with webif 1.4.9-6. Am I missing something, other than good taste in my choice of recordings?

The more substantive issue, that both you and bottletop have raised is what to do about the varying encodings of non-ASCII characters. This isn't one that directly affects me, but I'm happy to be involved in any further investigation if that would help (for example the suggested test above).

Thanks again.
 
I'm sure that's true, but I'm not sure that it's worth the effort, if we fix the underlying problem. It seems to be only me that's affected by it, and I can sort out my own problems by other means. (Apologies for the pun!)
Incorrect on several counts. There is a certain pragmatism in having a "helper" utility which just scans .hmt files and corrects them, it is much quicker and less effort than tracking down all the independent sources of the problem. Also, if it is found not to solve the problem (or introduces unexpected side-effects), it is easy to turn off. Yes, fair enough, the gold standard is to clean up everything rather than apply a sticking plaster, but that can continue in the background if anyone has the will. Data from deployment of the sticking plaster will be useful in the clean-up.

If the bug affects the sort order, it affects everyone even if nobody else has spotted it (or spotted it but not reported it).
 
As far as I have observed, the Humax software seems to cope with either 0x106937 or 0x15 as a prefix in all strings, except for the 0x29A Title string which does/must not have a prefix as it is automatically UTF-8.
I think the sensible thing to do is for hmt, and therefore the WebIf, to assume that all strings presented as command line input are in UTF-8 and to set the 0x15 prefix in all fields when modifying them (apart from 0x29A of course).
Would this approach agree with everyone else's observations/desires?
 
assume that all strings presented as command line input are in UTF-8
I am concerned whether that is a valid assumption. I have been delving into the detail of UTF-8, but not found the nub of the matter (I've read statements like "a single-byte ASCII* character is encoded as one byte in UTF-8" – but that obviously can't be true because UTF-8 has to have some way of indicating that it is a single byte encoding and not a multibyte encoding). There's lots written but very little seems authoritative, which is (I guess) why this is such a minefield.

* Bearing in mind I use ASCII (incorrectly) here to mean the extended ASCII set.

The question is: is it valid to simply tack a UTF-8 indicator flag in front of any ASCII-encoded string? Do all UTF-8 decoding control flags (by which I mean multi-byte indicators) correspond with ASCII control character codes (which should never appear in a text string)?

As a brief summary (and do not rely on this information for deployment):-

The Problem

The origins of telegraphic communication were (of course) in America, using English, by sending pulses down wires so that pressing typewriter keys at one end of the wire could actuate a print head at the other end. 5 bits of data was just about enough to send the letters of the English alphabet (26), with 5 codes spare as "shifts" (capital-shift, number-shift, punctuation-shift), and maybe not using "00000" because it's hard to detect that there are no pulses!

As technology developed, the teletype standard became 7 bits (ASCII, 128 codes) which was sufficient for all the upper and lower case letters, all the numbers, all the punctuation, and control codes such as CR and LF, without needing shifts... but again, catering for English only. As technology spread across the world, an 8th bit was introduced so that codes 128-255 could be used for accented characters etc, with a different set used in each locality.

Sharing documents between localities meant ensuring everyone was singing from the same code page... otherwise the text looked like garbage. Not too bad while there is a restricted audience for your output and you can tell everybody, but not so clever when text becomes shared worldwide through email and web pages, with the invention of the Internet.

Then there are the oriental languages which use thousands of pictograms instead of a small set of letters. How to accommodate those?

Multiple "solutions" have arisen, again with the problem of ensuring everyone is using the same standard. Standards for communicating the communication standard are embedded in web page data, but there is a crystalisation:

Unicode

Unicode provides a huge number of character codes, by using a 32-bit code space. In practice, about a million codes are usable, of which about one eighth are currently allocated to all the letters, accented letters, pictograms, etc for all the written languages in the world... and a whole raft of emojis (with more being added all the time – no idea why, and there's still room for a whole lot more!).

UTF-8

Allocating 4 bytes of data to every text character quadruples the amount of storage space and transmission bandwidth over plain ASCII, so is clearly inefficient in many situations (where the text is mainly Latin languages). UTF-8 is an encoding for Unicode where the most common characters are represented by one byte, the less common by two, and so on. The result is that text in plain English (for example) will occupy almost the same number of bytes as if it were ASCII.

UTF-8 is the default for HTML5.

ISO-6937

This is an alternative to UTF-8 / Unicode, catering for Latin alphabets only, where accented characters are encoded as two bytes Cx XX – where XX is the character and x is the accent. That makes ASCII codes C1-CF (except C9) unavailable as a single-byte character, and codes 00-7F correspond with the normal 7-bit ASCII codes.
 
Last edited:
... Since it has only recently been changed to insert the X'15', I'm hoping that it wouldn't be difficult to undo, or cause any other problems. ...

As observed, the hmt prgram sets the Title and ITitle from the +settitle=... option value. The change was to start passing the prefix in the option value, resulting in a prefix in both fields. So we have to modify the program and possibly its option syntax.

I think the sensible thing to do is for hmt, and therefore the WebIf, to assume that all strings presented as command line input are in UTF-8 and to set the 0x15 prefix in all fields when modifying them (apart from 0x29A of course).

An alternative is to invent an option to select the character encoding. But this seems excessive. We do need to identify the fields that know about tagged encodings (ITitle, Synopsis, ...?). For backward compatibility the modified hmt might leave known prefixes of field values passed to the program as options as-is, or convert them to UTF-8 with a 0x15 prefix, or strip any prefixes.

... is it valid to simply tack a UTF-8 indicator flag in front of any ASCII-encoded string?

Tl;dr: yes.

UTF-8 was invented, like almost everything else, by Ken Thompson (and Rob Pike). Unfortunately this was after all the worse ways of representing Unicode and other international character repertoires had already been deployed. UTF-8 allows programs to handle all the characters known to Unicode but correctly places the tax for funny characters (as we are calling them in this thread) on the people who insist on using them, by making their encoding longer, while the English-writing world can carry on as normal.

ASCII == ISO 646 characters encoded as a byte with value 0-127 carry over to UTF-8. All bytes of the 2-4 byte encodings have the top bit set and can't be confused with ASCII.
 
UTF-8 uses a variable width encoding system which uses one to four bytes to encode a single character.
It is correct that traditional ASCII (0-127) is encoded as a single byte.

0xxxxxxx: Single byte encoding
110xxxxx 10xxxxxx: Two byte encoding
1110xxxx 10xxxxxx 10xxxxxx: Three byte encoding
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: Four byte encoding

So when reading UTF-8 text a decoder must look at the most significant bit(s) to determine how many bytes to use in forming a particular character.

Edit: Added last sentence.
 
Last edited:
ASCII == ISO 646 characters encoded as a byte with value 0-127 carry over to UTF-8. All bytes of the 2-4 byte encodings have the top bit set and can't be confused with ASCII.
I realise that, but what I meant was the extended-ASCII 8-bit set (whatever that's called). Are we sure no extended-set characters will turn up, or at least none with C-F in the top nybble? Maybe a sanitiser needs to check the content of the string first.

Is it simply the case that the medialist sorter just reads the string plain, and therefore puts UTF-8 strings (beginning 0x15) ahead of non-UTF-8 strings (probably beginning >0x30), or is something more complicated going on?

0xxxxxxx: Single byte encoding
110xxxxx 10xxxxxx: Two byte encoding
1110xxxx 10xxxxxx 10xxxxxx: Three byte encoding
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: Four byte encoding
Thanks, I failed to find that amongst the waffle. By my reckoning then, UTF-8 is capable of representing 128+2048+65536+2097152=2,164,864 code points. I'm not clear why Unicode appears to offer only a subset of the 2^32 code points, and even a subset of 2,164,864 code points... but it's clearly a very complex subject (finding ways to satisfy every form of text worldwide) best left to the relevant committee!
 
I realise that, but what I meant was the extended-ASCII 8-bit set (whatever that's called). Are we sure no extended-set characters will turn up, or at least none with C-F in the top nybble? Maybe a sanitiser needs to check the content of the string first.
It may not be easy to distinguish between UTF-8 and ISO6937 containing either 8-bit character encodings or two-byte encodings.

Is it simply the case that the medialist sorter just reads the string plain, and therefore puts UTF-8 strings (beginning 0x15) ahead of non-UTF-8 strings (probably beginning >0x30), ...
William of Occam led me to believe that was the likely explanation, though "untagged" rather than "non-UTF-8".
 
Incorrect on several counts. There is a certain pragmatism in having a "helper" utility which just scans .hmt files and corrects them, it is much quicker and less effort than tracking down all the independent sources of the problem. Also, if it is found not to solve the problem (or introduces unexpected side-effects), it is easy to turn off. Yes, fair enough, the gold standard is to clean up everything rather than apply a sticking plaster, but that can continue in the background if anyone has the will. Data from deployment of the sticking plaster will be useful in the clean-up.

If the bug affects the sort order, it affects everyone even if nobody else has spotted it (or spotted it but not reported it).

You are right, of course. I was apparently mistaken in assuming that this was a problem of little interest to most people, and was therefore trying to minimise the effort. Sorry.

As to what to do about it, I start from the viewpoint that today's anomaly is either tomorrows understanding - or tomorrow's bug. Hence my interest in documenting the anomalies in the hmt file. So I wonder "what would it take to make sure that all the CFW-produced or -modified hmt files are consistent with those generated by the base firmware?" We've still got a few 'known unknowns' to resolve, but not many. The obvious worry is about compatibility. But given the in-built tolerance that the system has shown to the variations that we have produced over the years, maybe it's not that big a worry. The other problem is how much we have to do all at once in terms of retrofitting past formats. But you've already addressed that.

Just a thought.

Yes, I think that would be useful if you can manage it.

Will do.
 
I've run some tests on an RMA-ed machine with 1.0.32 (Jan 2013) base software installed. I recorded simultaneous versions of both the SD and HD versions of three programmes on this machine. For two of the programmes, I also recorded SD and HD versions on a separate machine with the latest CW installed. I deliberately chose programmes which would highlight the difference between those that had guidance and those that didn't.

The results are in the attached spreadsheet. I've also included the relevant hmt files in the zip, in case anybody wants to look at them directly. Note that the external file names of the hmt files have been modified slightly to make it easier to distinguish between the base and custom, and the SD and HD versions. Otherwise the hmt files themselves are untouched by me or (explicitly) by webif. A couple of observations:

1) They all show the same usage of the string fields and string field prefixes that we've seen elsewhere, specifically the use of X'15' for HD and X'106937' for SD text strings other than the untagged tile at offset 029A, and the X'15' prefixed channel name at offset 045C.

As to the filename offset, the different offset values were artefacts of the historical OEM firmware. The hmt program was modified to handle this. The 0x180 offset is the new standard.

2) Thanks for clarifying this. That's certainly consistent with these tests. It appears that the base firmware always stores the file name at the 017F offset, whilst a machine with CWF (from some level or other) installed will store it at 0180, even if the recording is not explicitly manipulated by webif functions. Is that right?
 

Attachments

  • Base vs Custom Comparisons.zip
    19.1 KB · Views: 2
the base firmware always stores the file name at the 017F offset, whilst a machine with CWF (from some level or other) installed will store it at 0180

The linked thread says it was a Humax change between OEM FW 1.02.32 and 1.03.12 (and nowadays no-one should be using a 1.02 version). As CFW doesn't modify the settop program that creates the .hmt files (actually, some hacks are applied dynamically, but not ones that would affect this), .hmt files from fresh recordings are the same for OEM vs CFW.
 
the use of X'15' for HD and X'106937' for SD text strings

You have to wonder if the SD code in the settop program was inherited from earlier code, pre-UTF-8, while the HD code was newly written for the HD models.
 
nowadays no-one should be using a 1.02 version
Not strictly true: AFAIK audio description on StDef doesn't work post-1.02.20. If you rely on AD you might want to stick with it regardless of other things not working.
 
AD works for me on E4 BBT S11E14, HD Fox-T2 1.03.02 CFW 3.13 after enabling it in Preferences>Audio.
 
News to me, not that I use AD. We had plenty of reports of it not working, but none it was working again (if I had noticed any I would have mentioned it in Things Every... section 1).
 
The linked thread says it was a Humax change between OEM FW 1.02.32 and 1.03.12 (and nowadays no-one should be using a 1.02 version). As CFW doesn't modify the settop program that creates the .hmt files (actually, some hacks are applied dynamically, but not ones that would affect this), .hmt files from fresh recordings are the same for OEM vs CFW.
That's really helpful, thanks. hmt seems to accept any string between offsets 017F and 0181 as valid filename. Presumably this shields the rest of the CFW from the variability. Though, if so, I'm not sure why so many recordings still use 017F.

You have to wonder if the SD code in the settop program was inherited from earlier code, pre-UTF-8, while the HD code was newly written for the HD models.
Quite possibly. But I also found myself wondering why it needed to tag these fields at all? As opposed to just knowing what decoding to apply to the strings at the relevant position - as it does for the title field. But that question opens up a whole can of worms, which I suspect we'd rather keep closed.

Incorrect on several counts. There is a certain pragmatism in having a "helper" utility which just scans .hmt files and corrects them, it is much quicker and less effort than tracking down all the independent sources of the problem. Also, if it is found not to solve the problem (or introduces unexpected side-effects), it is easy to turn off. Yes, fair enough, the gold standard is to clean up everything rather than apply a sticking plaster, but that can continue in the background if anyone has the will. Data from deployment of the sticking plaster will be useful in the clean-up.

If the bug affects the sort order, it affects everyone even if nobody else has spotted it (or spotted it but not reported it).

Do you think we now know enough to decide how much we want to fix? At one extreme, we can simply stop inserting X'15' into the title field. At the other we can ensure that all hmt files are consistent with native firmware ones, both for the title offset problem and string prefixes. Then there's the separate question of background tools to clean up existing hmt files.
 
AD works for me on E4 BBT S11E14, HD Fox-T2 1.03.02 CFW 3.13 after enabling it in Preferences>Audio.
Are you sure of that. I use 2 HDRs, one with 1.02.20 and the other with 1.03.12 and the difference in the AD for SD programmes where there is background noise (e.g action films, e.g. background sound effects, e.g. background music) is significant.

You can get some AD out with 1.03 .12 but it is not fully working, and for many programmes is not consistently good enough to use.

The AD function includes not only the AD audio track but also an indicator to how much the normal sound track should be reduced so that the AD audio can be clearly heard. This adjustment is dynamic and only occurs when there is sound on the AD track. Unfortunately after 1.02.20, Humax started to apply the volume level reduction to the AD track instead of the normal track. This means that the AD is never as clear as it should be, and can be totally incomprehensible due its volume is being reduced to a level where action scenes and music drown it. It does vary by programme but for BBC programmes iPlayer is always superior to the Humax HDR with 1.03.12. Non-Humax freeview recorders, and my native TV, that I have tried are a lot closer to the clarity and usability of iPlayer AD.
 
From an A-B comparison with the normally bypassed Bush DTV, the Humax AD was roughly the same volume as the audio track, whereas the Bush AD was extremely loud, requiring the ''relative volume" slider (not available for Humax) to be set at -20 out of -100..+100 for an acceptable experience. Probably "Britain's Lost Masterpieces" wasn't the best test material, a bit short on action.

A project for someone with spare time who cares: find where in the settop program the AD is mixed into the audio and whether there is a relative volume parameter; if so, add a hook (extend nugget?) to patch in a value set in WebIf.
 
Last edited:
Here are some more data points, using this method:
Code:
for f in /media/\[Video\]/*.hmt; do
    hexdump -C -s 0x45C "$f" | head -1
    hexdump -C -s 0x29A "$f" | head -4
    hexdump -C -s 0x516 "$f" | head -4
done
I found no recordings with a prefixed Title. This may be because the box is an HD, rarely used for detectads, etc.

Recordings from HD channels had 0x15 ITitle prefix, except for BBC FOUR HD "Revolution: New Art for a New World" which had no ITitle prefix. SD channels from HD muxes also seemed to have this prefix, but I can't be sure which those were at the time of recording: eg, Together "Eel Pie Island Hotel", Forces TV "Flying Through Time", Film4+1 "Joe Strummer: The Future Is".

Recordings from known and possibly other SD channels from SD muxes had 0x10937 ITitle prefix, except for ITV2+1 "The Pirates! In an Adventure with Scientists" which had no ITitle prefix.

Presumably the items with unprefixed ITitle had been edited at some point before the prefix change was committed.
 
Back
Top