UTF discussion (split from Dotfile Madness)

UTF discussion (split from Dotfile Madness)

Postby fluffrabbit » 10 Jul 2019, 09:25

As of C++20 you'd better write your own Unicode library. In fact, in any version of C and C++ so far you'd better write your own Unicode library since there is pretty much zero support in all standards.

So for Windows you convert from UTF-8 to UTF-16, then copy result into NUL-terminated sequence of wchar_t and then feed it to the WinAPI you need.

Sounds like a good solution. Going from utf8 to utf16/32 is straightforward. Of course you need custom code for that, but it's pretty trivial. And since apparently wide characters actually exist and I didn't imagine them, that's great.

That is not full support because NTFS and WinAPI allow unpaired surrogates in filenames which are ill-formed Unicode.

What is this magick of which you speak?

But I'd say this doesn't matter for games because you can't even print ill-formed string.

No idea. If you can display Unicode text in your game (stb, SDL, Freetype, etc.) then you can display Unicode text in your game.
fluffrabbit
 
Posts: 557
Joined: 11 Apr 2019, 11:17

Re: Dotfile Madness

Postby Lyberta » 10 Jul 2019, 13:20

Deleted.
Last edited by Lyberta on 01 Oct 2021, 09:59, edited 1 time in total.
Lyberta
 
Posts: 765
Joined: 19 Jun 2013, 10:45

Re: Dotfile Madness

Postby fluffrabbit » 10 Jul 2019, 20:12

Gotcha. Most common Earth languages appear in the 16-bit version, I believe called the Great Common Language Table or something similar.

My Unicode displayer can only display utf32 codepoints as translated from utf8, so there are no weird fuzzy wuzzies. From what I have assumed over the years, utf16 is weird legacy crap, but you can well-form it just by ignoring the other 16+ bits of languages such as neolithic proto-Chinese and Klingon.
fluffrabbit
 
Posts: 557
Joined: 11 Apr 2019, 11:17

Re: Dotfile Madness

Postby Lyberta » 11 Jul 2019, 10:25

Deleted.
Last edited by Lyberta on 01 Oct 2021, 10:00, edited 1 time in total.
Lyberta
 
Posts: 765
Joined: 19 Jun 2013, 10:45

Re: Dotfile Madness

Postby fluffrabbit » 11 Jul 2019, 13:49

FGD is BMP AF.

Prefer the term scalar value because only scalar values can be in well-formed UTF-32.

I do not understand. So far, there is only UTF-32. You don't got bigger codes than that. UTF-16 is legacy crap so that is excluded from most discussions. So is there a difference between a "UTF-32 character", a "codepoint", a "code point", or a "scalar"? At this point in time, I think not.

Again, surrogates themselves occupy 16 bit so you have to be careful not to put them carelessly.

I'm assuming that's a UTF-16 problem. UTF-32 has a reserved character range for "vendors" or somesuch I think. I dunno. Standard codepoints scalars are standard... Betcha can't break a trivial/naive UTF-32 displayer. AFAIK the weird corner cases were sanded away with the introduction of UTF-32, though perhaps that may be a dangerous assumption.
fluffrabbit
 
Posts: 557
Joined: 11 Apr 2019, 11:17

Re: Dotfile Madness

Postby Lyberta » 11 Jul 2019, 14:31

Deleted.
Last edited by Lyberta on 01 Oct 2021, 10:00, edited 1 time in total.
Lyberta
 
Posts: 765
Joined: 19 Jun 2013, 10:45

Re: Dotfile Madness

Postby fluffrabbit » 11 Jul 2019, 17:06

Clearly surrogates are an important part of the Unicode standard. Eventually I suppose Unicode will add a free-for-all bit that instructs the renderer to treat the code point/scalar as a bitmap and render little 8x8 characters or whatever. Always expanding to add new symbols.

You clearly have a deep understanding of this, and I see Unicode more as a tool that doesn't quite mesh with my priorities, even though I pay it lip service in my programs and recognize its importance.

A really naive displayer would do the cell thing. My displayer does horizontal kerning using a scaling factor based on a completely arbitrary magic number calculation that looks good on English language fonts. The diacritic trick results in exactly the output you describe, though judging by all the missing character boxes it's likely due to the lack of goodies in IBM Plex Serif Regular. The emoji situation is even worse because most emoji are packed as separate "emoji fonts" which lack ASCII characters, so you need some kind of multi-font setup or manual concatenation of fonts, which is like mixing Pepsi and Coke if they're from different vendors. Ugh.
fluffrabbit
 
Posts: 557
Joined: 11 Apr 2019, 11:17

Re: UTF discussion (split from Dotfile Madness)

Postby sago007 » 15 Jul 2019, 22:14

That is not full support because NTFS and WinAPI allow unpaired surrogates in filenames which are ill-formed Unicode.

The Windows API has backward compatibility for programs written for UCS-2. Linux file systems also allow invalid UTF-8 bytes because they allow other encodings.
Correcting the filenames when changing encoding is left as an exercise to the reader.

I do suffer from the surrogate problem in my professional life because there is a very popular Java web service framework that cannot encode UTF-8 correctly.
sago007
 
Posts: 11
Joined: 16 Jul 2017, 10:06

Re: UTF discussion (split from Dotfile Madness)

Postby GunChleoc » 31 Jul 2019, 22:06

I use http://site.icu-project.org/ for segmentation of Unicode strings. It also has lots of nifty locale implementation stuff that I haven't dug into yet.
User avatar
GunChleoc
 
Posts: 502
Joined: 20 Sep 2012, 22:45

Who is online

Users browsing this forum: No registered users and 0 guests