Another computer question...

Can't say as I know the correct terminology for which symbols. I know that this site will translate some of what you call a smiley into a graphic. But if you use the Android soft keypad you can get a different graphic for what you think is the same smiley! ::)and 😀 supposedly the same.
My understanding is/was that anything graphic makes the message into an MMS.
Mine also.

The soft keyboard on my tablet has a range of "smileys and emotions". Most of them are graphical. But there is a selection of the pre-graphical screen text-only versions. Whether these get translated to a graphical form in a text message I don't know!

Ah. I see you posted the real answer as I was typing!
 
What's the correspondence between 160 plain characters and 70 Unicode characters? Maybe 160 x 7 bits = 70 x 16 bits.
GSM supports 140 x 8 bits encoded using GSM-7 which gives you 160 x 7 bits.
But surely that's not the whole answer, because there might be some preamble to tell the receiving end that it's Unicode, and not all Unicode is 16 bits. As I understand it, Unicode is a superset of ASCII, so wouldn't the plain characters correspond 1:1 in Unicode anyway? Maybe, instead of "Unicode", we should be talking about UTF-16.
Use of the term "Unicode" is just sloppy. It's actually (fixed-width) UCS-2, which is 16 bits. So 140 x 8 / 16 gives you the 70.
 
GSM supports 140 x 8 bits encoded using GSM-7 which gives you 160 x 7 bits.

Use of the term "Unicode" is just sloppy. It's actually (fixed-width) UCS-2, which is 16 bits. So 140 x 8 / 16 gives you the 70.
So I have discovered, as I have been ferretting:

https://www.twilio.com/docs/glossary/what-is-gsm-7-character-encoding
GSM-7 is a character encoding standard which packs the most commonly used letters and symbols in many languages into 7 bits each for usage on GSM networks. As SMS messages are transmitted 140 8-bit octets at a time, GSM-7 encoded SMS messages can carry up to 160 characters...

GSM-7 is the standard alphabet for SMS messages, written up in the standard GSM 03.38. It is always supported on GSM networks. In languages with more than 128 commonly used symbols, GSM-7 is mandated. However, local language support is implemented with shift tables or by changing text encoding to (16-bit) UCS-2 encoding.

The basic character set for GSM-7 can be found here.

For some characters, such as '{' and ']', an escape code is required - so even in a GSM-7 encoded message these characters will be encoded using two characters.

...which confirms my suspicion that the "basic" system uses 7-bit encoding...

https://www.twilio.com/docs/glossary/what-is-ucs-2-character-encoding
UCS-2 is a character encoding standard in which characters are represented by a fixed-length 16 bits (2 bytes). It is used as a fallback on many GSM networks when a message cannot be encoded using GSM-7 or when a language requires more than 128 characters to be rendered.

...and my suspicion that 70 is the number of 16-bit characters that will fit in the same number of bits.

I've not yet found the UCS-2 code points for the national flags, but I have found this:

More interesting are the Unicode characters that are not combining characters, but compose in some way in practice anyway. The flag emoji, for example, don’t actually exist in Unicode. The Unicode Consortium didn’t want to be constantly amending a list of national flags as countries popped in and out of existence, so instead they cheated. They added a set of 26 regional indicator symbols, one for each letter of the English alphabet, and to encode a country’s flag you write its two-letter ISO country code with those symbols. So the Canadian flag, 🇨🇦, is actually the two characters U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C and U+1F1E6 REGIONAL INDICATOR SYMBOL LETTER A.

Still no answer why 209 ≠ 206 though.

The soft keyboard on my tablet has a range of "smileys and emotions". Most of them are graphical. But there is a selection of the pre-graphical screen text-only versions. Whether these get translated to a graphical form in a text message I don't know!
If I actually text ;-) and then look at what was sent (in the text app on my Android phone), the text smiley is converted into an Android wink icon – but that's not the same as the wink emoji so I guess there's just some display conversion rather than actual conversion.
 
Still no answer why 209 ≠ 206 though.
I guess liberties are being taken with displaying/performing calculations on numbers with different units. For example, does the 4 refer to 4 x 8 bits or 7 bits? Or even 16 bits?
Just like Humax did with their disk space fudges.
 
A single SMS can be upto 160 characters long. Once you go over this length, it is not correct that the second part can be a further 160 chars. There is a 7 byte loss per message (including the first) due to the overhead required to create an address to combine them.

This is the formula:

var tl=text_length;
var msgs=0;
var remd=160;

if (tl==0) {
} else if (tl<=160) {
msgs=1;
remd=160-tl;
} else {
msgs=Math.ceil(tl/153);
remd=153-(tl-((msgs-1)*153));
}

remd now equals the characters left before another sms message is required
msgs now equals the number of sms message credits required to send the text
 
There is a 7 byte loss per message (including the first) due to the overhead required to create an address to combine them.
Erm...

Do you really mean "7 bytes"? If you do, then that's 8 7-bit characters so your character count per message becomes 152 not 153.

If there are 133 bytes (down from 140 bytes) available per message in concatenated messages, how are UCS-2 messages packed into those? Is it 66 UCS-2 characters per message, or 66½?
 
There is a problem though. Setting up test messages and looking at the message stats:

1 message reads zero characters left at 160 characters (GSM-7);

170 characters produces "136/2" (ie 136 characters left out of two messages), so that would be 153 characters per message as suggested above. That means 49 bits lost to administrative functions, not 7 bytes.

310 characters produces "149/3", so three messages = 459 characters = 153 x 3

OK so far. What when I introduce a UCS-2?

1 message has 69 characters capacity
2 messages have 133 characters capacity
3 messages have 200 characters capacity

200 is not a multiple of 3, 133 is not a multiple of 2... but it works if a UCS-2 gets split into two bytes and packed as 133 bytes per message: 2 x 66½ = 133, 3 x 66½ = 199½.

BUT 133 bytes = 152 GSM-7 characters. Conclusion: there are 7 bits unused per message when multi-message UCS-2 is used, and another 8 bits unavailable in the 3rd message.

Based on this:

153 x 2 - 111 = 195

66½ x 3 - 4 = 196

Double Yay!!! :cheers:
 
There is a 7 byte loss per message (including the first) due to the overhead required to create an address to combine them.
No, there's a 7 GSM-7 character loss – 49 bits not 56 bits.

This is the formula:

var tl=text_length;
var msgs=0;
var remd=160;

if (tl==0) {
} else if (tl<=160) {
msgs=1;
remd=160-tl;
} else {
msgs=Math.ceil(tl/153);
remd=153-(tl-((msgs-1)*153));
}
Where did the formula come from? Is it definitive or empirical?

Here's my empirical version (BASIC – bear in mind this is only a thought experiment and I haven't actually tried to run it, and I'm rusty, so there might be syntax errors – eg I've assumed it is valid to have a label and a variable of the same name!):

Code:
REM INPUTS:
REM CHARS : Current number of characters in message preparation.
REM UCS2  : Flag that the message contains characters outside the
REM         GSM-7 set so that the message must be encoded UCS-2.

REM OUTPUTS:
REM MSGS  : The number of SMS messages required to be concatenated
REM         to accommodate CHARS.
REM REMD  : The spare capacity in the current number of messages,
REM         counted in characters.
REM MMS   : Flag that the message exceeds SMS limits and must be
REM         sent by MMS, so MSGS and REMD are invalid.

IF UCS2 = TRUE THEN GOTO UCS2
IF CHARS > 459 THEN GOTO MMS
IF CHARS > 160 THEN GOTO CONCAT_GSM7

REM         Single message encoded GSM-7
LET MSGS = 1
LET REMD = 160 - CHARS
LET MMS = FALSE
EXIT

CONCAT_GSM7:
REM         Multiple concatenated messages encoded GSM-7
LET MSGS = 1 + INT((CHARS - 1)/153)
LET REMD = (153 * MSGS) - CHARS
LET MMS = FALSE
EXIT

UCS2:
REM         Message contains non-GSM-7 characters
IF CHARS > 200 THEN GOTO MMS
IF CHARS > 69 THEN GOTO CONCAT_UCS2

REM         Single message encoded UCS-2
LET MSGS = 1
LET REMD = 69 - CHARS
LET MMS = FALSE
EXIT

CONCAT_UCS2:
REM         Multiple concatenated messages encoded UCS-2
LET MSGS = 2
LET REMD = 133 - CHARS
IF CHARS > 133 THEN LET MSGS = 3
IF CHARS > 133 THEN LET REMD = 200 - CHARS
LET MMS = FALSE
EXIT

MMS:
REM         Message exceeds SMS limits
LET MSGS = 0
LET REMD = 0
LET MMS = TRUE

END
 
Anybody got an idea on this one?:

Trying to find loose matches between two lists, each list comprising multi-word strings where a "loose match" is when a matching string contains roughly the same words but not necessarily in the same order, and when "same" allows for typos etc.

Ideally, the output would comprise a matrix of probabilities of a match, or several "best" matches with probabilities.

I wondered whether there might be any useful functions in database-land (eg SQLite), which I know next to nothing about.
 
That's a loose match - so loose I had to look it up to get the joke. (My brain's struggling with fuzzy matching!)

I wondered whether there might be any useful functions in database-land (eg SQLite), which I know next to nothing about.
I know very little about SQLite. A quick search suggests there is something. There also appears to be a function for PHP. ( I didn't spend long enough to discover whether this requires an SQL database or can be used on a list. I think it's the latter.) I put "fuzzy match" and "fuzzy match PHP" into a search engine (DuckDuckGo) and a list of useful looking results emerged. (Similar list in Google).
 
There also appears to be a function for PHP
There is a demo website for that, but it produces nothing when confronted by a sample of my real data (I suspect it contains too many space-delimited "words" in each search string).

Meanwhile, I've done it manually (tedious). Not that it might not come up again.
 
Last edited:
Back
Top