See You Again æ—â¶ã©â€”â´ã©æ’â½ã§å¸â¥ã©ââ€œ Ep45 End

Aw human. I was using "WTF-eight" to mean "Double UTF-viii", as I described virtually recently at [1]. Double UTF-8 is that unintentionally popular encoding where someone takes UTF-8, accidentally decodes it as their favorite single-byte encoding such as Windows-1252, then encodes those characters as UTF-8.

[1] http://weblog.luminoso.com/2015/05/21/ftfy-fixes-text-for-you-...

It was such a perfect abbreviation, but now I probably shouldn't apply it, as information technology would exist dislocated with Simon Sapin'southward WTF-viii, which people would really use on purpose.


> ÃÆ'ÂÆ'‚ÃÆ'‚ the future of publishing at W3C

That is an amazing example.

It'south not even "double UTF-8", it's UTF-viii six times (including the one to get information technology on the Web), information technology's been decoded equally Latin-1 twice and Windows-1252 3 times, and at the terminate there'southward a not-breaking infinite that'southward been converted to a infinite. All to represent what originated as a single non-breaking infinite anyway.

Which makes me happy that my module solves information technology.

                                                                      >>> from ftfy.fixes import fix_encoding_and_explain     >>> fix_encoding_and_explain("ÃÆ'ÂÆ'‚ÃÆ'‚ the future of publishing at W3C")     ('\xa0the future of publishing at W3C',      [('encode', 'sloppy-windows-1252', 0),       ('transcode', 'restore_byte_a0', 2),       ('decode', 'utf-8-variants', 0),       ('encode', 'sloppy-windows-1252', 0),       ('decode', 'utf-viii', 0),       ('encode', 'latin-1', 0),       ('decode', 'utf-eight', 0),       ('encode', 'sloppy-windows-1252', 0),       ('decode', 'utf-8', 0),       ('encode', 'latin-ane', 0),       ('decode', 'utf-eight', 0)])                                


Neato! I wrote a shitty version of 50% of that 2 years agone, when I was tasked with uncooking a bunch of information in a MySQL database as role of a larger migration to UTF-8. I hadn't washed that much pencil-and-paper bit manipulation since I was 13.


Awesome module! I wonder if anyone else had ever managed to reverse-engineer that tweet before.


I love this.

                                                                      The key words "WHAT", "DAMNIT", "Good GRIEF", "FOR Sky'S SAKE",     "RIDICULOUS", "BLOODY HELL", and "Die IN A GREAT Large Chemical Burn"     in this memo are to be interpreted as described in [RFC2119].                                


You lot really want to call this WTF (8)? Is it april 1st today? Am I the just i that thought this commodity is near a new funny project that is chosen "what the fuck" encoding, like when somebody announced he had written a to_nil gem https://github.com/mrThe/to_nil ;) Sorry simply I tin't terminate laughing.


This is intentional. I wish we didn't accept to do stuff similar this, just we do and that's the "what the fuck". All considering the Unicode Commission in 1989 actually wanted 16 bits to exist plenty for everybody, and of grade it wasn't.


The mistake is older than that. Wide character encodings in general are just hopelessly flawed.

WinNT, Java and a lot of more software utilize wide character encodings UCS2/UTF-16(/UTF-32?). And it was added to C89/C++ (wchar_t). WinNT really predates the Unicode standard past a yr or so. http://en.wikipedia.org/wiki/Wide_character , http://en.wikipedia.org/wiki/Windows_NT#Development

Converting between UTF-8 and UTF-16 is wasteful, though often necessary.

> wide characters are a hugely flawed idea [parent post]

I know. Back in the early nineties they idea otherwise and were proud that they used information technology in hindsight. Only nowadays UTF-viii is usually the better choice (except for maybe some asian and exotic later added languages that may crave more space with UTF-8) - I am not saying UTF-sixteen would exist a better option then, there are certain other encodings for special cases.

And equally the linked article explains, UTF-16 is a huge mess of complexity with back-dated validation rules that had to exist added because it stopped being a wide-character encoding when the new code points were added. UTF-xvi, when implemented correctly, is actually significantly more complicated to get right than UTF-8.

UTF-32/UCS-4 is quite elementary, though apparently it imposes a 4x penalty on bytes used. I don't know annihilation that uses it in practice, though surely something does.

Over again: wide characters are a hugely flawed idea.

Certain, go to 32 bits per character. But information technology cannot exist said to be "simple" and will non let you to brand the assumption that 1 integer = 1 glyph.

Namely it won't save you from the post-obit problems:

                                                                  * Precomposed vs multi-codepoint diacritics (Do you write á with       one 32 scrap char or with two? If it'southward Unicode the respond is both)      * Variation selectors (see as well Han unification)      * Bidi, RTL and LTR embedding chars                                                              
And possibly others I don't know well-nigh. I feel like I am learning of these dragons all the time.

I about similar that utf-16 and more and then utf-8 break the "1 character, one glyph" rule, considering information technology gets yous in the mindset that this is bogus. Because in Unicode it is nearly decidedly bogus, fifty-fifty if yous switch to UCS-4 in a vain attempt to avoid such problems. Unicode but isn't simple whatever way you slice it, so you might as well shove the complexity in everybody'south face and accept them confront it early on.

Y'all tin can't use that for storage.

> The mapping between negative numbers and graphemes in this class is not guaranteed constant, even betwixt strings in the same process.


What'south your storage requirement that's not adequately solved by the existing encoding schemes?


What are you suggesting, shop strings in UTF8 then "normalize" them into this baroque format whenever you load/salve them purely so that offsets correspond to character clusters? Doesn't seem worth the overhead to my optics.

In-memory string representation rarely corresponds to on-disk representation.

Various programming languages (Java, C#, Objective-C, JavaScript, ...) equally well as some well-known libraries (ICU, Windows API, Qt) use UTF-16 internally. How much data practise you have lying around that's UTF-16?

Certain, more than recently, Go and Rust have decided to go with UTF-viii, but that's far from common, and it does have some drawbacks compared to the Perl6 (NFG) or Python3 (latin-one, UCS-2, UCS-four as appropriate) model if you take to do actual processing instead of just passing opaque strings around.

Besides note that you lot have to go through a normalization pace anyway if you don't want to be tripped up by having multiple ways to represent a unmarried grapheme.

NFG enables O(N) algorithms for character level operations.

The overhead is entirely wasted on code that does no character level operations.

For code that does do some graphic symbol level operations, avoiding quadratic behavior may pay off handsomely.

i call up linux/mac systems default to UCS-four, certainly the libc implementations of wcs* do.

i agree its a flawed thought though. 4 billion characters seems like enough for at present, but i'd guess UTF-32 will need extending to 64 too... and actually how almost decoupling the size from the data entirely? it works well enough in the general case of /every type of data we know nearly/ that i'grand pretty certain this specialised use case is non very special.

The Unixish C runtimes of the world uses a 4-byte wchar_t. I'1000 not enlightened of annihilation in "Linux" that actually stores or operates on 4-byte character strings. Evidently some software somewhere must, simply the overwhelming bulk of text processing on your linux box is washed in UTF-viii.

That's not remotely comparable to the state of affairs in Windows, where file names are stored on disk in a sixteen chip not-quite-wide-character encoding, etc... And it'southward leaked into firmware. GPT partitioning names and UEFI variables are xvi bit despite never once being used to store anything simply ASCII, etc... All that software is, broadly, incompatible and buggy (and of questionable security) when faced with new code points.

We don't even have 4 billion characters possible at present. The Unicode range is merely 0-10FFFF, and UTF-16 can't stand for whatsoever more than that. So UTF-32 is restricted to that range too, despite what 32 bits would allow, never mind 64.

But nosotros don't seem to be running out -- Planes 3-xiii are completely unassigned so far, roofing 30000-DFFFF. That's nigh 65% of the Unicode range completely untouched, and planes i, ii, and xiv nevertheless accept big gaps too.

> But we don't seem to be running out

The issue isn't the quantity of unassigned codepoints, information technology'south how many private use ones are bachelor, only 137,000 of them. Publicly available individual use schemes such as ConScript are fast filling up this space, mainly past encoding block characters in the same way Unicode encodes Korean Hangul, i.eastward. by using a formula over a minor set up of base components to generate all the cake characters.

My own surrogate scheme, UTF-88, implemented in Get at https://github.com/gavingroovygrover/utf88 , expands the number of UTF-8 codepoints to ii billion equally originally specified past using the top 75% of the individual utilize codepoints every bit 2nd tier surrogates. This scheme can hands be fitted on top of UTF-sixteen instead. I've taken the liberty in this scheme of making xvi planes (0x10 to 0x1F) available every bit private apply; the rest are unassigned.

I created this scheme to assistance in using a formulaic method to generate a unremarkably used subset of the CJK characters, perhaps in the codepoints which would be 6 bytes under UTF-8. It would exist more difficult than the Hangul scheme because CJK characters are congenital recursively. If successful, I'd look at pitching the UTF-88 surrogation scheme for UTF-16 and having UTF-8 and UTF-32 officially extended to two billion characters.


NFG uses the negative numbers down to about -two billion as a implementation-internal individual use surface area to temporarily shop graphemes. Enables fast character-based manipulation of strings in Perl vi. Though such negative-numbered codepoints could simply be used for private utilize in data interchange betwixt third parties if the UTF-32 was used, because neither UTF-8 (even pre-2003) nor UTF-16 could encode them.


Yes. sizeof(wchar_t) is two on Windows and 4 on Unix-like systems, so wchar_t is pretty much useless. That's why C11 added char16_t and char32_t.


I'grand wondering how common the "mistake" of storing UTF-16 values in wchar_t on Unix-similar systems? I know I thought I had my lawmaking carefully basing whether information technology was UTF-16 or UTF-32 based on the size of wchar_t, only to discover that ane of the supposedly portable libraries I used had UTF-16 no affair how big wchar_t was.


Oh ok it'south intentional. Thx for explaining the choice of the proper noun. Not simply because of the name itself but besides by explaining the reason behind the choice, you lot achieved to get my attention. I will try to discover out more nigh this problem, because I approximate that as a developer this might have some impact on my work sooner or later and therefore I should at least be enlightened of information technology.

to_nil is actually a pretty important function! Completely little, plain, simply it demonstrates that there's a canonical way to map every value in Ruby to cipher. This is essentially the defining feature of nil, in a sense.

With typing the interest here would be more than clear, of course, since information technology would exist more apparent that zip inhabits every type.


The principal motivator for this was Servo'southward DOM, although it ended upward getting deployed first in Rust to deal with Windows paths. We haven't determined whether we'll demand to use WTF-8 throughout Servo—it may depend on how certificate.write() is used in the wild.

Then we're going to see this on web sites. Oh, joy.

It's time for browsers to start saying no to really bad HTML. When a browser detects a major error, it should put an error bar across the top of the folio, with something like "This folio may brandish improperly due to errors in the page source (click for details)". Start doing that for serious errors such as Javascript code aborts, security errors, and malformed UTF-8. Then extend that to pages where the character encoding is ambiguous, and stop trying to approximate character encoding.

The HTML5 spec formally defines consistent handling for many errors. That's OK, at that place's a spec. Stop in that location. Don't try to outguess new kinds of errors.

No. This is an internal implementation detail, non to be used on the Web.

Every bit to callous error handling, that's what XHTML is about and why information technology failed. Just define a somewhat sensible behavior for every input, no matter how ugly.


Yes, that bug is the all-time identify to offset. We've hereafter proofed the compages for Windows, but at that place is no direct work on it that I'm aware of.


What does the DOM do when information technology receives a surrogate half from Javascript? I thought that the DOM APIs (due east.g. createTextNode, innerHTML setter, setAttribute, HTMLInputElement.value setter, certificate.write) would all strip out the lonely surrogate code units?


In current browsers they'll happily pass around solitary surrogates. Zero special happens to them (v. whatever other UTF-sixteen lawmaking-unit) till they reach the layout layer (where they obviously cannot be drawn).


I found this through https://news.ycombinator.com/item?id=9609955 -- I discover information technology fascinating the solutions that people come up with to deal with other people's issues without damaging correct code. Rust uses WTF-8 to interact with Windows' UCS2/UTF-16 hybrid, and from a quick wait I'm hopeful that Rust's story around handling Unicode should exist much nicer than (say) Python or Java.


Have y'all looked at Python 3 nonetheless? I'm using Python three in product for an internationalized website and my experience has been that it handles Unicode pretty well.

Not that great of a read. Stuff like:

> I accept been told multiple times now that my point of view is wrong and I don't understand beginners, or that the "text model" has been inverse and my request makes no sense.

"The text model has changed" is a perfectly legitimate reason to refuse ideas consequent with the previous text model and inconsistent with the current model. Keeping a coherent, consistent model of your text is a pretty important function of curating a language. I of Python's greatest strengths is that they don't but pile on random features, and keeping old crufty features from previous versions would amount to the aforementioned affair. To dismiss this reasoning is extremely shortsighted.


Many people who prefer Python3's way of handling Unicode are enlightened of these arguments. It isn't a position based on ignorance.


Hey, never meant to imply otherwise. In fact, even people who have bug with the py3 fashion often agree that it'south notwithstanding better than ii's.


Python 3 doesn't handle Unicode any better than Python two, information technology merely made it the default string. In all other aspects the situation has stayed as bad as it was in Python ii or has gotten significantly worse. Skilful examples for that are paths and anything that relates to local IO when you're locale is C.

> Python 3 doesn't handle Unicode any better than Python ii, it just made it the default string. In all other aspects the situation has stayed equally bad every bit information technology was in Python 2 or has gotten significantly worse.

Mayhap this has been your feel, but it hasn't been mine. Using Python 3 was the single best decision I've fabricated in developing a multilingual website (we support English/German/Spanish). In that location'due south non a ton of local IO, merely I've upgraded all my personal projects to Python 3.

Your complaint, and the complaint of the OP, seems to be basically, "It's unlike and I take to change my code, therefore it's bad."

My complaint is non that I have to change my code. My complaint is that Python 3 is an attempt at breaking every bit little compatibilty with Python 2 every bit possible while making Unicode "easy" to utilize. They failed to achieve both goals.

Now we have a Python three that's incompatible to Python 2 but provides almost no pregnant benefit, solves none of the big well known problems and introduces quite a few new problems.


I have to disagree, I think using Unicode in Python iii is currently easier than in any language I've used. It certainly isn't perfect, but information technology's amend than the alternatives. I certainly take spent very lilliputian fourth dimension struggling with it.


That is not quite true, in the sense that more of the standard library has been made unicode-aware, and implicit conversions betwixt unicode and bytestrings have been removed. And then if you're working in either domain you become a coherent view, the problem beingness when yous're interacting with systems or concepts which straddle the divide or (even worse) may be in either domain depending on the platform. Filesystem paths is the latter, information technology's text on OSX and Windows — although mayhap ill-formed in Windows — but it's purse-o-bytes in most unices. There Python 2 is only "amend" in that issues will probably fly nether the radar if y'all don't prod things too much.

In that location is no coherent view at all. Bytes nonetheless have methods similar .upper() that brand no sense at all in that context, while unicode strings with these methods are broken considering these are locale dependent operations and there is no appropriate API. You can also index, piece and iterate over strings, all operations that you really shouldn't exercise unless y'all really now what you are doing. The API in no mode indicates that doing any of these things is a problem.

Python two handling of paths is non good because there is no good abstraction over different operating systems, treating them equally byte strings is a sane lowest common denominator though.

Python three pretends that paths can be represented equally unicode strings on all OSes, that's non truthful. That is held up with a very leaky abstraction and ways that Python code that treats paths as unicode strings and not as paths-that-happen-to-be-unicode-but-really-arent is broken. About people aren't aware of that at all and information technology's definitely surprising.

On top of that implicit coercions have been replaced with implicit broken guessing of encodings for example when opening files.

When you say "strings" are you referring to strings or bytes? Why shouldn't you slice or index them? It seems like those operations brand sense in either case just I'g sure I'thou missing something.

On the guessing encodings when opening files, that's not really a problem. The caller should specify the encoding manually ideally. If you don't know the encoding of the file, how can you decode information technology? You could still open up it as raw bytes if required.

I used strings to hateful both. Byte strings tin can be sliced and indexed no problems because a byte as such is something y'all may actually want to deal with.

Slicing or indexing into unicode strings is a problem considering information technology's not articulate what unicode strings are strings of. You tin can look at unicode strings from different perspectives and see a sequence of codepoints or a sequence of characters, both tin can exist reasonable depending on what you want to practice. Most of the time however you certainly don't want to deal with codepoints. Python however only gives you a codepoint-level perspective.

Guessing encodings when opening files is a problem precisely because - equally you mentioned - the caller should specify the encoding, not just sometimes simply ever. Guessing an encoding based on the locale or the content of the file should exist the exception and something the caller does explicitly.

Information technology slices past codepoints? That'due south simply light-headed, so we've gone through this whole unicode everywhere process then nosotros tin can stop thinking almost the underlying implementation details but the api forces you to have to deal with them anyway.

Fortunately information technology's not something I deal with often merely thanks for the info, will stop me getting caught out after.


I think you are missing the departure between codepoints (as distinct from codeunits) and characters.

And unfortunately, I'm not anymore enlightened as to my misunderstanding.

I get that every different matter (graphic symbol) is a different Unicode number (code point). To store / transmit these you demand some standard (encoding) for writing them down as a sequence of bytes (code units, well depending on the encoding each code unit is fabricated upwards of different numbers of bytes).

How is any of that in conflict with my original points? Or is some of my above understanding wrong.

I know you lot have a policy of not reply to people so maybe someone else could stride in and clear up my confusion.


Codepoints and characters are not equivalent. A character can consist of one or more codepoints. More than importantly some codepoints merely alter others and cannot stand on their ain. That ways if yous slice or index into a unicode strings, you might get an "invalid" unicode cord dorsum. That is a unicode string that cannot be encoded or rendered in whatsoever meaningful way.

Right, ok. I call up something about this - ü tin be represented either past a unmarried code indicate or by the letter 'u' preceded by the modifier.

Equally the user of unicode I don't really intendance well-nigh that. If I slice characters I expect a piece of characters. The multi code indicate thing feels like it'south just an encoding detail in a unlike identify.

I guess you need some operations to get to those details if yous demand. Human being, what was the bulldoze behind adding that actress complexity to life?!

Cheers for explaining. That was the piece I was missing.

bytes.upper is the Correct Thing when y'all are dealing with ASCII-based formats. It also has the advantage of breaking in less random means than unicode.upper.

And I mean, I tin can't really retrieve of whatsoever cross-locale requirements fulfilled by unicode.upper (peradventure example-insensitive matching, but then you also desire to do lots of other filtering).

> There Python 2 is merely "improve" in that issues will probably fly nether the radar if you don't prod things likewise much.

Ah yes, the JavaScript solution.


Well, Python iii's unicode support is much more than complete. As a lilliputian example, case conversions now comprehend the whole unicode range. This holds pretty consistently - Python two'south `unicode` was incomplete.

> It is unclear whether unpaired surrogate byte sequences are supposed to be well-formed in CESU-8.

According to the Unicode Technical Written report #26 that defines CESU-viii[one], CESU-8 is a Compatibility Encoding Scheme for UTF-16 ("CESU"). In fact, the manner the encoding is defined, the source data must be represented in UTF-xvi prior to converting to CESU-8. Since UTF-xvi cannot stand for unpaired surrogates, I think it'southward safe to say that CESU-eight cannot represent them either.

[ane] http://www.unicode.org/reports/tr26/

From the commodity:

>UTF-16 is designed to stand for any Unicode text, just information technology tin can not represent a surrogate code point pair since the corresponding surrogate 16-fleck code unit pairs would instead represent a supplementary lawmaking signal. Therefore, the concept of Unicode scalar value was introduced and Unicode text was restricted to not contain whatsoever surrogate lawmaking point. (This was presumably deemed simpler that only restricting pairs.)

This is all gibberish to me. Can someone explain this in laymans terms?

People used to think 16 $.25 would exist enough for anyone. It wasn't, so UTF-16 was designed as a variable-length, backwards-compatible replacement for UCS-two.

Characters exterior the Basic Multilingual Plane (BMP) are encoded equally a pair of 16-scrap code units. The numeric value of these code units denote codepoints that lie themselves within the BMP. While these values tin can be represented in UTF-eight and UTF-32, they cannot be represented in UTF-16. Because nosotros want our encoding schemes to be equivalent, the Unicode lawmaking space contains a hole where these and then-called surrogates lie.

Because not everyone gets Unicode right, real-world data may comprise unpaired surrogates, and WTF-viii is an extension of UTF-viii that handles such data gracefully.

I sympathise that for efficiency we desire this to be as fast every bit possible. Unproblematic pinch can accept care of the wastefulness of using excessive infinite to encode text - then it really just leaves efficiency.

If was to make a showtime attempt at a variable length, but well defined backwards compatible encoding scheme, I would use something similar the number of bits upto (and including) the first 0 bit as defining the number of bytes used for this character. So,

> 0xxxxxxx, 1 byte > 10xxxxxx, 2 bytes > 110xxxxx, 3 bytes.

We would never run out of codepoints, and lecagy applications tin simple ignore codepoints it doesn't sympathise. We would simply waste 1 bit per byte, which seems reasonable given but how many issues encoding usually represent. Why wouldn't this work, apart from already existing applications that does not know how to do this.

That'due south roughly how UTF-8 works, with some tweaks to brand information technology self-synchronizing. (That is, you can leap to the centre of a stream and find the adjacent code point by looking at no more than iv bytes.)

As to running out of lawmaking points, nosotros're express by UTF-sixteen (up to U+10FFFF). Both UTF-32 and UTF-8 unchanged could go up to 32 bits.


Pretty unrelated just I was thinking most efficiently encoding Unicode a week or ii ago. I think there might be some value in a fixed length encoding but UTF-32 seems a fleck wasteful. With Unicode requiring 21 (xx.09) bits per code point packing 3 code points into 64 bits seems an obvious idea. But would it be worth the hassle for example equally internal encoding in an operating system? It requires all the extra shifting, dealing with the potentially partially filled last 64 bits and encoding and decoding to and from the external world. Is the desire for a fixed length encoding misguided because indexing into a cord is way less common than it seems?

When you lot utilise an encoding based on integral bytes, yous can use the hardware-accelerated and often parallelized "memcpy" majority byte moving hardware features to dispense your strings.

But inserting a codepoint with your approach would require all downstream bits to be shifted inside and across bytes, something that would be a much bigger computational brunt. It's unlikely that anyone would consider saddling themselves with that for a mere 25% infinite savings over the dead-simple and memcpy-able UTF-32.

I recall you'd lose one-half of the already-minor benefits of fixed indexing, and at that place would be plenty extra complexity to go out you worse off.

In addition, in that location'southward a 95% take chances yous're not dealing with plenty text for UTF-32 to hurt. If yous're in the other 5%, and then a packing scheme that'south 1/three more efficient is still going to injure. There'southward no proficient utilise instance.

Coding for variable-width takes more endeavour, but it gives y'all a better result. You can split strings appropriate to the use. Sometimes that's code points, but more oft information technology's probably characters or bytes.

I'm not even certain why yous would want to find something like the 80th code point in a string. It's rare enough to not be a peak priority.


Yes. For example, this allows the Rust standard library to convert &str (UTF-8) to &std::ffi::OsStr (WTF-eight on Windows) without converting or even copying data.


An interesting possible awarding for this is JSON parsers. If JSON strings incorporate unpaired surrogate code points, they could either throw an error or encode as WTF-viii. I bet some JSON parsers call back they are converting to UTF-8, only are really converting to GUTF-8.

The name is unserious simply the project is very serious, its writer has responded to a few comments and linked to a presentation of his on the subject field[0]. Information technology'due south an extension of UTF-eight used to bridge UTF-8 and UCS2-plus-surrogates: while UTF8 is the modern encoding you have to interact with legacy systems, for UNIX'south numberless of bytes you may exist able to assume UTF8 (possibly ill-formed) simply a number of other legacy systems used UCS2 and added visible surrogates (rather than proper UTF-16) afterwards.

Windows and NTFS, Java, UEFI, Javascript all work with UCS2-plus-surrogates. Having to collaborate with those systems from a UTF8-encoded world is an issue because they don't guarantee well-formed UTF-16, they might incorporate unpaired surrogates which can't be decoded to a codepoint allowed in UTF-8 or UTF-32 (neither allows unpaired surrogates, for obvious reasons).

WTF8 extends UTF8 with unpaired surrogates (and unpaired surrogates just, paired surrogates from valid UTF16 are decoded and re-encoded to a proper UTF8-valid codepoint) which allows interaction with legacy UCS2 systems.

WTF8 exists solely as an internal encoding (in-memory representation), but it's very useful at that place. It was initially created for Servo (which may need it to have an UTF8 internal representation all the same properly interact with javascript), but turned out to first be a boon to Rust's OS/filesystem APIs on Windows.

[0] http://exyr.org/2015/!!Con_WTF-8/slides.pdf

> WTF8 exists solely as an internal encoding (in-memory representation)

Today.

Want to bet that someone will cleverly determine that it'south "only easier" to use it every bit an external encoding as well? This kind of true cat always gets out of the bag eventually.


Better WTF8 than invalid UCS2-plus-surrogates. And UTF-8 decoders will just turn invalid surrogates into the replacement graphic symbol.


I idea he was tackling the other problem which is that you lot frequently find web pages that take both UTF-8 codepoints and unmarried bytes encoded as ISO-latin-1 or Windows-1252

The nature of unicode is that there'due south always a problem yous didn't (but should) know existed.

And because of this global confusion, anybody of import ends upwardly implementing something that somehow does something moronic - and so then everyone else has notwithstanding some other problem they didn't know existed and they all autumn into a self-harming screw of depravity.


Some time ago, I made some ASCII art to illustrate the various steps where things can go incorrect:

                                                                      [user-perceived characters]                 ^                 |                 v        [grapheme clusters] <-> [characters]                 ^                   ^                 |                   |                 v                   v             [glyphs]           [codepoints] <-> [lawmaking units] <-> [bytes]                                


And then basically it goes wrong when someone assumes that any two of the to a higher place is "the same matter". It's oftentimes implicit.

That's certainly one important source of errors. An obvious case would be treating UTF-32 equally a fixed-width encoding, which is bad because you might end upwards cutting grapheme clusters in half, and you tin easily forget about normalization if you recall most it that fashion.

Then, information technology's possible to brand mistakes when converting betwixt representations, eg getting endianness wrong.

Some issues are more than subtle: In principle, the decision what should exist considered a single character may depend on the language, nevermind the debate nigh Han unification - but as far as I'chiliad concerned, that's a WONTFIX.

Allow me encounter if I have this straight. My agreement is that WTF-8 is identical to UTF-8 for all valid UTF-sixteen input, but information technology can too round-trip invalid UTF-xvi. That is the ultimate goal.

Below is all the background I had to learn near to empathize the motivation/details.

UCS-2 was designed as a 16-bit fixed-width encoding. When it became clear that 64k lawmaking points wasn't enough for Unicode, UTF-xvi was invented to bargain with the fact that UCS-two was causeless to exist stock-still-width, but no longer could exist.

The solution they settled on is weird, but has some useful properties. Basically they took a couple code point ranges that hadn't been assigned yet and allocated them to a "Unicode within Unicode" coding scheme. This scheme encodes (1 big code point) -> (2 small code points). The pocket-sized lawmaking points volition fit in UTF-16 "code units" (this is our name for each two-byte unit in UTF-16). And for some more terminology, "big lawmaking points" are called "supplementary code points", and "small code points" are called "BMP lawmaking points."

The weird thing nearly this scheme is that we bothered to brand the "2 pocket-size code points" (known as a "surrogate" pair) into real Unicode lawmaking points. A more normal thing would exist to say that UTF-16 code units are totally separate from Unicode code points, and that UTF-16 lawmaking units accept no meaning exterior of UTF-16. An number like 0xd801 could take a code unit significant as function of a UTF-xvi surrogate pair, and too be a totally unrelated Unicode code indicate.

But the one prissy property of the manner they did this is that they didn't pause existing software. Existing software causeless that every UCS-2 graphic symbol was also a lawmaking point. These systems could be updated to UTF-16 while preserving this assumption.

Unfortunately it made everything else more complicated. Because now:

- UTF-sixteen tin can be ill-formed if it has any surrogate code units that don't pair properly.

- we have to figure out what to do when these surrogate code points — code points whose only purpose is to assistance UTF-xvi break out of its 64k limit — occur exterior of UTF-xvi.

This becomes particularly complicated when converting UTF-sixteen -> UTF-eight. UTF-8 has a native representation for large lawmaking points that encodes each in four bytes. But since surrogate code points are real code points, you could imagine an alternative UTF-8 encoding for big code points: make a UTF-xvi surrogate pair, then UTF-8 encode the ii code points of the surrogate pair (hey, they are real code points!) into UTF-8. But UTF-8 disallows this and merely allows the canonical, 4-byte encoding.

If you feel this is unjust and UTF-8 should exist allowed to encode surrogate lawmaking points if it feels like it, then yous might like Generalized UTF-viii, which is exactly like UTF-8 except this is allowed. It'southward easier to convert from UTF-16, because you don't demand any specialized logic to recognize and handle surrogate pairs. Y'all yet need this logic to go in the other direction though (GUTF-viii -> UTF-16), since GUTF-eight tin have big code points that yous'd need to encode into surrogate pairs for UTF-xvi.

If y'all like Generalized UTF-viii, except that yous always desire to use surrogate pairs for big lawmaking points, and you lot want to totally disallow the UTF-8-native 4-byte sequence for them, you might like CESU-viii, which does this. This makes both directions of CESU-8 <-> UTF-16 easy, because neither conversion requires special handling of surrogate pairs.

A nice property of GUTF-8 is that information technology can round-trip any UTF-16 sequence, fifty-fifty if it's ill-formed (has unpaired surrogate code points). It's pretty easy to get ill-formed UTF-16, because many UTF-sixteen-based APIs don't enforce wellformedness.

But both GUTF-viii and CESU-8 have the drawback that they are non UTF-8 compatible. UTF-8-based software isn't generally expected to decode surrogate pairs — surrogates are supposed to be a UTF-sixteen-only peculiarity. Well-nigh UTF-8-based software expects that once it performs UTF-8 decoding, the resulting lawmaking points are real code points ("Unicode scalar values", which make upward "Unicode text"), not surrogate code points.

And so basically what WTF-8 says is: encode all lawmaking points every bit their real lawmaking indicate, never equally a surrogate pair (like UTF-8, unlike GUTF-8 and CESU-8). However, if the input UTF-16 was sick-formed and contained an unpaired surrogate code betoken, then yous may encode that lawmaking betoken direct with UTF-8 (similar GUTF-8, not allowed in UTF-8).

Then WTF-eight is identical to UTF-8 for all valid UTF-xvi input, but it can too circular-trip invalid UTF-16. That is the ultimate goal.

Past the fashion, i thing that was slightly unclear to me in the physician. In section 4.2 (https://simonsapin.github.io/wtf-8/#encoding-ill-formed-utf-...):

> If, on the other hand, the input contains a surrogate lawmaking signal pair, the conversion will be incorrect and the resulting sequence volition not represent the original lawmaking points.

It might be more clear to say: "the resulting sequence volition not correspond the surrogate code points." It might be by some fluke that the user actually intends the UTF-16 to translate the surrogate sequence that was in the input. And this isn't really lossy, since (AFAIK) the surrogate code points exist for the sole purpose of representing surrogate pairs.

The more interesting case here, which isn't mentioned at all, is that the input contains unpaired surrogate code points. That is the case where the UTF-16 will actually end upwards beingness ill-formed.


The encoding that was designed to be fixed-width is chosen UCS-two. UTF-xvi is its variable-length successor.

hmmm... await... UCS-2 is just a cleaved UTF-16?!?!

I thought it was a singled-out encoding and all related issues were largely imaginary provided you /just/ handle things right...

UCS2 is the original "wide character" encoding from when lawmaking points were defined as xvi $.25. When codepoints were extended to 21 bits, UTF-16 was created as a variable-width encoding uniform with UCS2 (so UCS2-encoded data is valid UTF-16).

Sadly systems which had previously opted for stock-still-width UCS2 and exposed that detail as part of a binary layer and wouldn't break compatibility couldn't go along their internal storage to 16 bit code units and move the external API to 32.

What they did instead was keep their API exposing 16 bits code units and declare it was UTF16, except about of them didn't bother validating anything so they're really exposing UCS2-with-surrogates (not even surrogate pairs since they don't validate the information). And that's how you lot notice lone surrogates traveling through the stars without their mate and shit'south all fucked upward.

The given history of UTF-16 and UTF-viii is a bit muddled.

> UTF-sixteen was redefined to be ill-formed if it contains unpaired surrogate xvi-bit code units.

This is incorrect. UTF-xvi did not exist until Unicode 2.0, which was the version of the standard that introduced surrogate code points. UCS-ii was the 16-bit encoding that predated it, and UTF-16 was designed as a replacement for UCS-2 in order to handle supplementary characters properly.

> UTF-eight was similarly redefined to be ill-formed if it contains surrogate byte sequences.

Not really truthful either. UTF-8 became office of the Unicode standard with Unicode 2.0, and so incorporated surrogate code point handling. UTF-8 was originally created in 1992, long earlier Unicode 2.0, and at the fourth dimension was based on UCS. I'g not actually sure it's relevant to talk about UTF-viii prior to its inclusion in the Unicode standard, but fifty-fifty then, encoding the code point range D800-DFFF was non allowed, for the aforementioned reason it was really not allowed in UCS-2, which is that this code point range was unallocated (it was in fact role of the Special Zone, which I am unable to observe an actual definition for in the scanned expressionless-tree Unicode i.0 book, merely I haven't read it cover-to-encompass). The stardom is that information technology was not considered "ill-formed" to encode those code points, and so it was perfectly legal to receive UCS-2 that encoded those values, procedure information technology, and re-transmit it (as it'due south legal to process and retransmit text streams that represent characters unknown to the process; the supposition is the process that originally encoded them understood the characters). So technically yeah, UTF-8 changed from its original definition based on UCS to 1 that explicitly considered encoding D800-DFFF every bit ill-formed, only UTF-8 as information technology has existed in the Unicode Standard has always considered information technology ill-formed.

> Unicode text was restricted to not contain whatever surrogate code point. (This was presumably deemed simpler that simply restricting pairs.)

This is a scrap of an odd parenthetical. Regardless of encoding, it's never legal to emit a text stream that contains surrogate code points, every bit these points have been explicitly reserved for the use of UTF-sixteen. The UTF-eight and UTF-32 encodings explicitly consider attempts to encode these code points as ill-formed, but in that location's no reason to always permit it in the first place as it's a violation of the Unicode conformance rules to practice so. Because there is no process that can possibly have encoded those code points in the start place while befitting to the Unicode standard, there is no reason for any process to try to interpret those lawmaking points when consuming a Unicode encoding. Allowing them would but exist a potential security hazard (which is the aforementioned rationale for treating non-shortest-form UTF-eight encodings as ill-formed). It has zip to do with simplicity.

keemormis.blogspot.com

Source: https://news.ycombinator.com/item?id=9611710

0 Response to "See You Again æ—â¶ã©â€”â´ã©æ’â½ã§å¸â¥ã©ââ€œ Ep45 End"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel