UTF discussion (split from Dotfile Madness)

UTF discussion (split from Dotfile Madness)

Postby fluffrabbit » 10 Jul 2019, 09:25

As of C++20 you'd better write your own Unicode library. In fact, in any version of C and C++ so far you'd better write your own Unicode library since there is pretty much zero support in all standards.

So for Windows you convert from UTF-8 to UTF-16, then copy result into NUL-terminated sequence of wchar_t and then feed it to the WinAPI you need.

Sounds like a good solution. Going from utf8 to utf16/32 is straightforward. Of course you need custom code for that, but it's pretty trivial. And since apparently wide characters actually exist and I didn't imagine them, that's great.

That is not full support because NTFS and WinAPI allow unpaired surrogates in filenames which are ill-formed Unicode.

What is this magick of which you speak?

But I'd say this doesn't matter for games because you can't even print ill-formed string.

No idea. If you can display Unicode text in your game (stb, SDL, Freetype, etc.) then you can display Unicode text in your game.
fluffrabbit
 
Posts: 570
Joined: 11 Apr 2019, 11:17

Re: Dotfile Madness

Postby Lyberta » 10 Jul 2019, 13:20

fluffrabbit {l Wrote}:
That is not full support because NTFS and WinAPI allow unpaired surrogates in filenames which are ill-formed Unicode.

What is this magick of which you speak?


Unicode code point is any value between 0 and 10FFFF. Unicode scalar value is any Unicode code point except surrogates. Surrogates are used in UTF-16 to encode scalar values greater than FFFF. It takes 2 surrogates to encode one scalar value. Well-formed Unicode only contains Unicode scalar values so encoding surrogate code points in UTF-8 and UTF-32 and having unpaired surrogates in UTF-16 makes the text ill-formed.

No idea. If you can display Unicode text in your game (stb, SDL, Freetype, etc.) then you can display Unicode text in your game.


Well if we say that ill-formed Unicode is not really Unicode then a library can print well-formed Unicode but can throw an exception or print replacement character when there is ill-formed sequence in the string.
User avatar
Lyberta
 
Posts: 643
Joined: 19 Jun 2013, 10:45

Re: Dotfile Madness

Postby fluffrabbit » 10 Jul 2019, 20:12

Gotcha. Most common Earth languages appear in the 16-bit version, I believe called the Great Common Language Table or something similar.

My Unicode displayer can only display utf32 codepoints as translated from utf8, so there are no weird fuzzy wuzzies. From what I have assumed over the years, utf16 is weird legacy crap, but you can well-form it just by ignoring the other 16+ bits of languages such as neolithic proto-Chinese and Klingon.
fluffrabbit
 
Posts: 570
Joined: 11 Apr 2019, 11:17

Re: Dotfile Madness

Postby Lyberta » 11 Jul 2019, 10:25

fluffrabbit {l Wrote}:Gotcha. Most common Earth languages appear in the 16-bit version, I believe called the Great Common Language Table or something similar.


It's called Basic Multilingual Plane and the most common language - emoji - is outside of it mostly.

OH LOL. I wanted to put rainbow flag emoji I copied from here and got this:


General Error
SQL ERROR [ mysqli ]

Incorrect string value: '\xF0\x9F\x8F\xB3\xEF\xB8...' for column 'post_text' at row 1 [1366]

An SQL error occurred while fetching this page. Please contact the Board Administrator if this problem persists.

Please notify the board administrator or webmaster: hagish@schattenkind.net


See what you get when you don't support Unicode properly?

fluffrabbit {l Wrote}:My Unicode displayer can only display utf32 codepoints as translated from utf8, so there are no weird fuzzy wuzzies.


Again, the word "code point" (notice the space) is unhelpful. Prefer the term scalar value because only scalar values can be in well-formed UTF-32.

fluffrabbit {l Wrote}:From what I have assumed over the years, utf16 is weird legacy crap, but you can well-form it just by ignoring the other 16+ bits of languages such as neolithic proto-Chinese and Klingon.


Again, surrogates themselves occupy 16 bit so you have to be careful not to put them carelessly. And if you ignore scalar values outside of BMP you become homophobe like this forum who doesn't allow me to display my LGBT pride flag.
User avatar
Lyberta
 
Posts: 643
Joined: 19 Jun 2013, 10:45

Re: Dotfile Madness

Postby fluffrabbit » 11 Jul 2019, 13:49

FGD is BMP AF.

Prefer the term scalar value because only scalar values can be in well-formed UTF-32.

I do not understand. So far, there is only UTF-32. You don't got bigger codes than that. UTF-16 is legacy crap so that is excluded from most discussions. So is there a difference between a "UTF-32 character", a "codepoint", a "code point", or a "scalar"? At this point in time, I think not.

Again, surrogates themselves occupy 16 bit so you have to be careful not to put them carelessly.

I'm assuming that's a UTF-16 problem. UTF-32 has a reserved character range for "vendors" or somesuch I think. I dunno. Standard codepoints scalars are standard... Betcha can't break a trivial/naive UTF-32 displayer. AFAIK the weird corner cases were sanded away with the introduction of UTF-32, though perhaps that may be a dangerous assumption.
fluffrabbit
 
Posts: 570
Joined: 11 Apr 2019, 11:17

Re: Dotfile Madness

Postby Lyberta » 11 Jul 2019, 14:31

fluffrabbit {l Wrote}:I do not understand. So far, there is only UTF-32. You don't got bigger codes than that. UTF-16 is legacy crap so that is excluded from most discussions. So is there a difference between a "UTF-32 character", a "codepoint", a "code point", or a "scalar"? At this point in time, I think not.


There is really no thing such as "character" in Unicode. "codepoint" is incorrect spelling of "code point". There is also "code unit" which is the smallest amount of data for an encoding form. I should probably write an article that clarifies things. In UTF-32 code units occupy 32 bits so in C they can be represented as uint32_t or char32_t. I guess for C char32_t makes the most sense so far. In C++ I prefer strong types. Here's some snippets from my Unicode library.

Perhaps some code might help.

{l Code}: {l Select All Code}
constexpr UTF32CodeUnit::UTF32CodeUnit(value_type value)
   : m_value{value}
{
   if (((m_value >= 0xD800) && (m_value <= 0xDFFF)) || (m_value > 0x10FFFF))
   {
      throw std::domain_error{"UTF32CodeUnit::UTF32CodeUnit: Invalid value."};
   }
}


This is the constructor of UTF32CodeUnit class. value_type is char32_t. As you can see, I throw exception if a user tries to construct UTF-32 code unit with illegal value. Values between D800 and DFFF are surrogates that are not allowed in UTF-32 values above 10FFFF are illegal in all encoding forms for obvious reasons.

Now, another class:

{l Code}: {l Select All Code}
constexpr CodePoint::CodePoint(value_type value)
   : m_value{value}
{
   if (m_value > 0x10FFFF)
   {
      throw std::domain_error{"CodePoint::CodePoint: Invalid value."};
   }
}


This is the constructor of CodePoint class. As you can see, I allowed surrogates here because surrogates are legal code points. For that reason, I don't really use CodePoint class in my code. It is provided for convenience.

Next:

{l Code}: {l Select All Code}
constexpr ScalarValue::ScalarValue(value_type value)
   : m_value{value}
{
   if (((m_value >= 0xD800) && (m_value <= 0xDFFF)) || (m_value > 0x10FFFF))
   {
      throw std::domain_error{"ScalarValue::ScalarValue: Invalid value."};
   }
}


This is the constructor of ScalarValue class. Notice how it is identical in code to UTF32CodeUnit ctor. I could have merged the two classes together but that would make the rest of the library more complicated. I prefer to separate code unit level from scalar value level in all cases.

By the way, all three classes use char32_t as the raw value because the domain of raw values is the same. The reason to make different classes is to preserve semantics and makes obvious bugs of assigning one to another a compile time error.


fluffrabbit {l Wrote}:I'm assuming that's a UTF-16 problem. UTF-32 has a reserved character range for "vendors" or somesuch I think. I dunno. Standard codepoints scalars are standard...


That is a logic problem. You can encode surrogate code point as "UTF-8" but that will no longer be UTF-8 but WTF-8 - a different encoding that is not a part of Unicode standard. Surrogates and Private Use area are different code point ranges.

fluffrabbit {l Wrote}:Betcha can't break a trivial/naive UTF-32 displayer. AFAIK the weird corner cases were sanded away with the introduction of UTF-32, though perhaps that may be a dangerous assumption.


I can break naive algorithm with "combining characters" (terrible name, I know). For example:

T̲̙͖̪͔̈͊̂͑ͅḦ̸̞̱́͑̓̒̓͟͝I͜͢S̗̹͏̞̳̱̄̍̑͗ ̧̩͕̪͓ͤ̀̓̿͟I̙̻̰̮̻ͩ̑̆̎͜S̝̅ ̖͕̦͌̇̂͛̓̓̓A̵̧̡͓̯̤͛ͤ̊͆ ̱̜̦͇̱͌ͤ͋ͦ̾L̛̹̖̦͗̂̾ͣ͏̴O̷̴̴̮̯̞͇̅́͝Ṫ̶̢͚͋̑ͮ̑ͧ͡ ̶̶̡̩̑̑͒ͫͦͅO͏̯̗̂͂̿́ͤͬ͢F̡̧̣̯̜ͩ̄ͮ͋́ ̢̦̙ͯ̾̉̔͡͝͡D̻̩͍̹̩ͩ̔̃ͥ͟İ̦͚̯͉͔̑͂̿͋A̧̯̭͖ͮ̿̇̓͊͟C̸̥̪̞̙̩͆̾̿͐R̢̜͔̥͓̽͑ͣͮ͜I̸̲ͩ̉̏̏ͣ͂̌͟T̨̪̝̱͓̒̀̈́͒ͥI̖͍ͧ͑̌ͣ̆͋̕͡C̡̨͚̻̗͊̀͆̌͢S̶͍̰̬̦ͧ͆̀ͫ͠ ̊͘͏̛͚͎̾I̸̵͙̭̾ͫ̾̈̅͗N̵̨̤̪̥̋̾̾̋͑ ̡̩̥̖͇̂̔̿̽͡B͈̫̦̘̤̝͒̈́ͮ̇M̬͈̻̖͕̯͜͏̴͔P̴̙̦͔̝̩ͭͩ͟ͅ

This is actually an output of a test program for my Unicode library:

{l Code}: {l Select All Code}
/// \file
/// \brief Example program which adds random amount of diacritics to each scalar
/// value of a given string.
/// \author Lyberta
/// \copyright GNU GPLv3 or any later version.

#include <iostream>
#include <experimental/random>

#include <ftz/Unicode/ScalarValueSequence.h>

int main(int argc, char* argv[])
try
{
   if (argc < 2)
   {
      std::cerr << "Usage: " << argv[0] << " <input string>\n";
      return 1;
   }
   ftz::Unicode::ScalarValueSequence sequence{std::string{argv[1]}};
   auto scalar_value = std::begin(sequence);
   while (scalar_value != std::end(sequence))
   {
      ++scalar_value;
      auto diacritics = std::experimental::randint(1, 112);
      for (std::uint8_t i = 0; i < diacritics; ++i)
      {
         scalar_value = sequence.insert(scalar_value,
            ftz::Unicode::ScalarValue{static_cast<char32_t>(
            std::experimental::randint(0x300, 0x36F))});
         ++scalar_value;
      }
   }
   std::cout << sequence.GetBuffer() << '\n';
}
catch (std::exception& e)
{
   std::cerr << "Exception caught: " << e.what() << '\n';
   return 1;
}
catch (...)
{
   std::cerr << "Unknown exception.\n";
   return 1;
}


Here I put the random amount of scalar values from Combining Diacritical Marks to each scalar value of the given string. Notice that the raw values of those are in [300, 36F] range, well inside even BMP. A naive displayer would probably try to allocate an individual cell for each scalar value which would produce a very wide but mostly empty render. And will probably break all width calculation down the road.
User avatar
Lyberta
 
Posts: 643
Joined: 19 Jun 2013, 10:45

Re: Dotfile Madness

Postby fluffrabbit » 11 Jul 2019, 17:06

Clearly surrogates are an important part of the Unicode standard. Eventually I suppose Unicode will add a free-for-all bit that instructs the renderer to treat the code point/scalar as a bitmap and render little 8x8 characters or whatever. Always expanding to add new symbols.

You clearly have a deep understanding of this, and I see Unicode more as a tool that doesn't quite mesh with my priorities, even though I pay it lip service in my programs and recognize its importance.

A really naive displayer would do the cell thing. My displayer does horizontal kerning using a scaling factor based on a completely arbitrary magic number calculation that looks good on English language fonts. The diacritic trick results in exactly the output you describe, though judging by all the missing character boxes it's likely due to the lack of goodies in IBM Plex Serif Regular. The emoji situation is even worse because most emoji are packed as separate "emoji fonts" which lack ASCII characters, so you need some kind of multi-font setup or manual concatenation of fonts, which is like mixing Pepsi and Coke if they're from different vendors. Ugh.
fluffrabbit
 
Posts: 570
Joined: 11 Apr 2019, 11:17

Re: UTF discussion (split from Dotfile Madness)

Postby sago007 » 15 Jul 2019, 22:14

That is not full support because NTFS and WinAPI allow unpaired surrogates in filenames which are ill-formed Unicode.

The Windows API has backward compatibility for programs written for UCS-2. Linux file systems also allow invalid UTF-8 bytes because they allow other encodings.
Correcting the filenames when changing encoding is left as an exercise to the reader.

I do suffer from the surrogate problem in my professional life because there is a very popular Java web service framework that cannot encode UTF-8 correctly.
sago007
 
Posts: 8
Joined: 16 Jul 2017, 10:06

Re: UTF discussion (split from Dotfile Madness)

Postby GunChleoc » 31 Jul 2019, 22:06

I use http://site.icu-project.org/ for segmentation of Unicode strings. It also has lots of nifty locale implementation stuff that I haven't dug into yet.
User avatar
GunChleoc
 
Posts: 440
Joined: 20 Sep 2012, 22:45

Who is online

Users browsing this forum: No registered users and 1 guest