45

It seems to me that Unicode is the "final" character encoding. I cannot imagine anything else replacing it at this point. I'm frankly confused about why UTF-16 and UTF-32 etc. exist at all, not to mention all the non-Unicode character encodings (unless for legacy purposes).

In my system, I've hardcoded UTF-8 as the one and only supported character encoding for my database, my source code files, and any data I create or import to my system. My system internally works solely in UTF-8. I cannot imagine ever needing to change this, for any reason.

Is there a reason I should expect this to change at some point? Will UTF-8 ever become "obsolete" and replaced by "UniversalCode-128" or something, which also includes the alphabets of later discovered nearby galaxies' civilizations?

Timone
  • 467
  • 1
  • 4
  • 3

9 Answers9

36

UTF-8 might not last forever, but you probably don't have to worry too much.

Two universal truths:

  • We can't predict the future.
  • Nothing lasts forever, especially in software.

But that doesn't mean the benefit of (trying to) future-proof your code always outweighs the cost.

Is UTF-8 likely to become obsolete any time soon?

I would say no. UTF-8 is quite common, which makes it harder to replace. Unicode also still has quite a bit of empty space, meaning there isn't likely to be a pressing need to replace it soon. Between 2010 and 2020, less than 40k characters have been added. It will take about 240 years to use up the remaining ~1 million unallocated characters if we assume we keep going at the same rate. This is a lot faster than I imagined, but still quite a while away and assuming it will keep going at the same rate is quite an assumption.

It also doesn't seem like there'd be a need to replace it due to a fundamental flaw in the encoding. With other types of standards or technologies there may be some security issue that could be exploited, but this doesn't seem likely with character encodings that only tells you how characters are stored.

I speculate if a need to replace it arises, it would be due to inefficiencies or constraints in new technology. Someone could develop some new piece of technology that rethinks how data is stored or loaded, which might make UTF-8 less than ideal or unusable. But there would still be plenty of systems without that technology for quite a few years.

Note that I didn't ask "are we likely to see a new character encoding any time soon". Anyone can create a new standard, but that doesn't mean it will be widely adopted nor replace other standards.

How bad would it be for you if there's a new standard?

Probably not that bad.

Even if there is a new standard that's widely adopted, your system will likely keep functioning for the foreseeable future with little to no changes. There are a lot of legacy systems out there.

If your system doesn't support the new encoding, you may have some issues with the user or other systems trying to send you data you don't support. But your system could still use UTF-8 internally, even if this means you don't support some characters (which might not be good, but it won't necessarily break your system).

Also, if it were to be replaced due to a reason other than running out of space (which, as noted above, doesn't seem likely any time soon), UTF-8 could likely be extended to include any characters in the new encoding. Meaning you can just convert from one encoding to the other where required and UTF-8 would still be usable.

Unicode versus Unicode?

The difference between UTF-8, UTF-16 and UTF-32 seems minor when compared to other (non-Unicode) encodings. They all support the same characters, so it shouldn't be a huge issue if one replaces the other.

If another one of those were to become the widely adopted one, it would probably be trivial to convert between them where required and continue to use UTF-8 everywhere else.

Bernhard Barker
  • 965
  • 6
  • 13
19

When it comes to software, the future always means needing to handle more data--- bigger files, and more of them in a shorter period of time. How does UTF-8 processing scale in those situations?

UTF-8 uses a variable number of bytes per character. This saves a lot of space if your text is ASCII plus the occasional emoji or accented letter. But a drawback of variable-length encoding is that jumping to an arbitrary position scales linearly with the size of the document. A fixed width encoding like UTF-32 would use more space but jumping to a position in the document is constant time. Depending on the size of the document and speed of the medium you're reading it from, linear time seeking vs. constant time seeking could make a huge difference in the performance of your application. It is better to be able to tradeoff space for time or the reverse as the situation demands.

Kyle Jones
  • 8,207
  • 2
  • 30
  • 52
10

UTF-8 might not last forever, but if you permit long UTF-8 again, it will outlast all other encodings that exist today. I have heard it projected that we will eventually run out of UTF-16 codepoints, necessitating the abandoning of UTF-16. We can go all the way to 0x7FFFFFFF.

Table from Wikipeida:

1   U+0000     U+007F     0xxxxxxx                  
2   U+0080     U+07FF     110xxxxx  10xxxxxx                
3   U+0800     U+FFFF     1110xxxx  10xxxxxx  10xxxxxx          
4   U+10000    U+1FFFFF   11110xxx  10xxxxxx  10xxxxxx  10xxxxxx        
5   U+200000   U+3FFFFFF  111110xx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx  
6   U+4000000  U+7FFFFFFF 1111110x  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx

There's some debate how to extend this should it ever arise Do we go 1111111x for 0x7FFFFFFF to 0xFFFFFFFF and say larger code points cannot occur, or we do 11111110 beginning a 7 byte sequence for 36 bits of codepoint and permitting 11111111 for 8 byte sequence.

Editorial: I do not mind the fact this answer is mildly controversial. The whole answer is about the fact that UTF-8 and UTF-32 are more future-proofed than any other well-known encodings.

Joshua
  • 390
  • 2
  • 11
9

UTF-8 is an elegant hack to remain backward compatible with ASCII and trivially compatible with Latin-1, which were both widely entrenched when Unicode started to take hold. UTF-8 can be extended further and still remain backward compatible with itself, by adding 5- and 6-byte encodings. So if Unicode decides it needs a few more bits to represent its character repertoire, there will be some hitches to make sure you programs get updated, but your existing data should be just fine. (Just as UTF-16 is backward compatible with UCS-2.) UTF-8 is deeply entrenched, so if it ever becomes obsolete, the new encoding system will almost certainly be backward compatible with UTF-8. Your existing data won't need to be converted, just as your ASCII documents are still perfectly good today.

How could UTF-8 become obsolete? It seems like Unicode has so much room for expansion that running out is almost unimaginable. On the other hand...

  • Unicode did run out of space once before, when it was a 16-bit system. I believe the limits of 16-bit was an impetus for the Han Unification, which combined substantially similar characters among several Pacific Asian languages.
  • Our idea of text being a serial stream of code points could evolve to the point that the Unicode approach isn't sufficient or scalable. Current best practice is to split content from styling almost completely. But it's hard to get that separation right (see CSS and evolving markup languages). It's not too hard to imagine at least some styling creeping back into the textual representation. Depending how that's done, it could have a massive multiplicative effect on the scale of Unicode.

    In fact, some of this has already happened. The Han Unification largely works, but to render a multi-lingual document properly, you need to know which spans of CJK symbols are Chinese, Japanese, or Korean. Because while the general shape of the unified symbols are the same and the concepts they represent align, they generally should be drawn with language-specific fonts. If you have just the text and not the styling, it's impossible for a machine to know which strings are from which language. So Unicode has a way to add language tags to get it right (just like you need some special characters to handle some Bidi edge cases). This is arguably styling (or, at least, mark-up) embedded directly in the text. And they're not widely supported.

  • Emoji. I was surprised that Unicode adopted emoji (beyond a handful to preserve legacy documents). In my mind, it doesn't seem to fit what Unicode set out to do, but the consortium's membership includes smart phone makers. There are many emoji, and the number is growing at quick pace. Combining characters are used to style your smiley by setting the character's gender, hair color, skin tone, occupation, etc. Emoji are becoming a generative script.

  • Icons. Now that we have a wider range of device resolutions and some extensions to font technology (thanks to emoji), software is turning to fonts for clean resolution-independent icon rendering. Unicode has recognized a couple hundred wing dings, so why not icons? If they can be assigned a semantic meaning (e.g., "SAVE ICON") instead of a descriptive name (e.g., "FLOPPY DISC ICON"), all the better. And if they start to fold in some styling information (e.g., "SAVE ICON", "DISABLED SAVE ICON", "PRESSED SAVE ICON", ...) we could see a massive number of these becoming standardized.

  • Private use. Currently, private use areas are used for icons (as above), for corporate logos, and sometimes even for original names (which I've heard is or was a trending in Japan). Documents with private use code points have semantic gaps and are inherently tied to styling information (custom fonts). I wouldn't be surprised if Unicode doesn't eventually start to allocate dedicated code points to corporate logos, and/or we'll see styling slither into our text documents.

  • Aliens. This won't happen for a long time, but it's easy to imagine alien languages being written in ways that cannot be represented as a linear stream of code points. What if the alien's script cannot be divorced from styling information? What if they have a generative writing system that cannot be reproduced with finite sets of glyphs, combining marks, and shaping rules?

7

Your question appears to slightly conflate two related concepts (as people often do):

  1. Unicode is a standard, whose primary part is a "coded character set" - a list of "code points", and a lot of metadata around them, attempting to catalogue all the world's writing systems. It has a defined "code space" of the numbers 0 to 10FFFF (hexadecimal) inclusive (most of which has not yet been filled with actual defined code points).
  2. UTF-8, UTF-16, and various other "encoding schemes" are ways of storing and transmitting Unicode code points. They can all represent all the code points, present and future, that the Unicode code space can theoretically hold.

When you talk about "something which also includes the alphabets of later discovered nearby galaxies' civilizations", you are implicitly talking about superseding or extending the Unicode character set itself.

If your system can "only" represent the code points of Unicode as we know it today (regardless of how it represents them), it would need to be upgraded if you needed to store these extra alphabets. It's impossible to say what this would involve; the new system might be cleverly designed to allow easy upgrades, or it might be that we adopt a system from Alpha Centauri, and all Unicode text needs carefully re-processing into their system. At that point, whether you picked UTF-8 or UTF-EBCDIC to store your Unicode would feel like an irrelevant detail.

If Unicode is not superseded or extended, any system capable of storing all Unicode code points will remain capable of storing them. So the theoretical limitations of UTF-8 in particular are not to do with what it can store, but how convenient it is to work with.

Currently, UTF-8 is the most popular encoding scheme, for various reasons - it has backwards-compatibility with ASCII, is compact when storing text containing mostly Latin characters, and works in multiples of 8 bits. Consequently, there are many tools for working with it - the virtuous cycle of standardisation. However, a new encoding scheme might become popular due to changes in common requirements - for instance, given an extremely "wide" memory, you could allocate a fixed width for each grapheme (i.e. even wider than the 32 bits required to fix the width of each code point).

If that happened, we can see what the upgrade would look like - you would need to convert your UTF-8 text to and from this encoding scheme in order to use tools built for it, which might be slow. But if you were still representing Unicode code points, such a transform is guaranteed to be possible without losing any data in either direction.

IMSoP
  • 523
  • 2
  • 9
5

I'm frankly confused about why UTF-16 and UTF-32 etc. exist at all

UTF-16 exists because Unicode was originally supposed to be a fixed-width 16-bit encoding and many systems were designed during this era and needed to be retrofitted to support more characters. These aren't some niche systems or systems that are on their way out, they are major current techologies like windows, .net, Java and QT.

UTF-32 exists because some people think it's easier if each code point is stored in a fixed-size unit. IMO this is largely illusory as there is not a 1:1 mapping between unicode code points and what users would call characters (for example most users would say that "Spın̈al Tap" has 10 characters but it requires 11 unicode codepoints to represent) but it nevertheless exists as a perceived advantage.

I don't think either of these encodings is going to go away any time soon. So if your system interacts widely with other stuff you are likely to end up dealing with other unicode encodings sooner or later.

Will UTF-8 ever become "obsolete" and replaced by "UniversalCode-128" or something, which also includes the alphabets of later discovered nearby galaxies' civilizations?

If we ever establish meaningful contact with intelligent alien life then some decisions would need to be made as to how to represent their languages on our computers and vice-versa. That could eventually mean switching away from computing standards as we know them today to a new set of interplanetary standards.

I think realistically though it's incredibly unlikely that will happen. IMO even if alien-life exists and even if we discover it, it would be impractical to establish meaningful communications without faster than light communications and/or travel and that would mean breaking physics as we know it.

Assuming we don't establish contact with aliens and assuming that we keep using computers that resemble those we use today, it seems unlikely that our text representation systems will be radically changed, it's possible that at some point the codepoint space will be expanded, but I think it's more likely that greater use will be made of combining characters, variant selectors etc to allow new languages to be represented with fewer code-point allocations.

Peter Green
  • 235
  • 1
  • 5
2

There is a theoretical possibility that over a million code points might not be enough. This is made less likely by the fact that characters can be made from more than one code point, so we could easily reserve one of the 17 pages for "intergalactic languages", where the first code point specifies one of 65,000 languages and the second code point a character in the language.

We might extend UTF-8 to 5 byte characters, but that would break lots of current code that correctly expects 4 byte only.

Why more than 65536 code points? It turned out to be not enough, once more and more languages were added, and Chinese / Japanese characters gettng more complete. Using multiple codepoints to extend the character set as I suggested would be a desparate measure, and even a few non-terrestrial civilisations would likely not require it. Doing this without need would be very wrong. There are emojis consisting of many codepoints, but there is a good reason for that.

gnasher729
  • 32,238
  • 36
  • 56
1

1. Unicode is the standard in all fields

Unicode is the unwreckable standard, and multi-byte UTF-8 with its ASCII subset for most purposes, like HTML, the most compact, even for Asian script when mingled with plain Latin script.

Two-byte UTF-16 has the fixed size advantage: taking the nth 256 bytes from a file form 128 UTF-8 chars. Whereas UTF-8 could have a have a half multi-byte sequence at the block limits. However UTF-16 is a historical error, as Unicode grew out the 16 bit range, and now for many Unicode code points, symbols, need two UTF-16 chars. So its fixed-size advantage is moot.

UTF-32, four bytes per code point, is natural, though Unicode is still in the 3 byte range, and will be for some time. So it is guaranteed to at least waste ¼, and even ¾ for plain Latin script.

UTF-8, UTF-16, UTF-32 do not really compete. In the programming language Java char is UTF-16, String literals are stored in the .class as UTF-8. The latest java versions even allow String, text in Unicode, to store solely internally the text in say ISO-8859-1.

UTF-8 will be the main Unicode Transformation Format for text files.

2. Unicode has flaws

Unicode might be the Esperanto of encodings (like with clever features), but that does not come without flaws. The main one is that there are different code point sequences for the principly the same text; there is no canonical form of Unicode. So é can be one single code point, or two: e and a zero-width ´. Again Java offers conversion in java.text.Normalizer.

Another (minor) flaw under Windows, one can determine when a file content is not in UTF-8, but without reading the content that is not feasible, if the encoding is apriori unknown. But that would hold for any other universal encoding too.

Flaws will not imply a future demise of Unicode. However not without glitches. There might come a time when a canonical form of Unicode becomes obligatory; needing a conversion of existing UTF-8 to say UTF-8C.

3. Chaotic Changes Possible

  • An "UTF-24" could be more politically correct, as with UTF-8 Asian scripts have a serious disadvantage.
  • A redesign of Unicode itself seems academically interesting, and could find its proponents, people favoring something new.

This is counter-balanced by the numerous UTF-8 data: XML in general, jason, general Linux encoding in UTF-8, Windows multiple single-byte encodings (which make UTF-8/UTF-16 the lingua-franca for portable text in many applications).

Conclusion

There is no reason to fear a demise of UTF-8.

I was one of the earlier adapters of UTF-8 in programming. And now keep my projects in UTF-8.

Joop Eggen
  • 113
  • 4
1

UTF-8 is an elegant way to encode a large range of numbers with a variable number of 8-bit bytes. As long as we don't need more characters than it can represent (unlikely unless the people of year 3000 write entirely with emojis that don't exist yet), there really isn't much reason to switch to another encoding. There is far too much momentum in English-centric computing to warrant an encoding that prioritizes other languages...

...That is, unless we encounter intelligent life and start having to integrate our information systems with theirs. All bets are off at this point. There's no guarantee they chose 8-bit clusters as their primary computing data unit. There's also no guarantee they're using binary or electricity as their primary means of computing. But even if they also used 8-bit bytes with 1 mapped to high voltage and 0 mapped to low, the probability that they created the exact same symbols and corresponding bit-encoding is so unlikely that calling it astronomically unlikely wouldn't cut it.

At that point, there will inevitably be a long negotiation process to develop standard codepoints, hopefully with some ability to bring in more intelligent species' languages later. During this time, there will be dozens of competing standards and the growing pains of changing encodings like those that characterized the 90s and early 00s. After a couple decades, humans and aliens will have it figured out and produce a standard that encodes both species' symbols without undue preferential treatment. A decade or two later, most new software will use that encoding.

Maybe then, we'll finally get rid of the wealth of obsolete ASCII control codes and reassign uppercase letters to higher codepoints in order to make room for alien letters.

Even still, there's always a chance that UTF-8 will still work as a multibyte encoding after the addition of an alien race or two. The main difference is that it won't correspond to the same Unicode assignments. We could also potentially keep our own encodings and then have translation layers between them. Anything can happen with aliens.

Beefster
  • 161
  • 4