+ 31

Utf

What is difference between utf-8 and utf-16

27th Sep 2020, 12:19 PM

JAY • ≫

17 ответов

+ 40

Part 1 Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits. Main UTF-8 pros: Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases. No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too. UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue. Main UTF-8 cons: Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly. Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8.

27th Sep 2020, 12:34 PM

Raj Srivastava

+ 25

Part 2 Main UTF-16 pros: BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters. Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string. Main UTF-16 cons: Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory. Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't.

27th Sep 2020, 12:35 PM

Raj Srivastava

+ 19

Last words: In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility. Not totally my words. Happy Coding </>

27th Sep 2020, 12:35 PM

Raj Srivastava

+ 9

It is encoding of file which is the ability to show i.e(from input -> binary > to Screen)different languages or programming language in you laptop for more see here https://en.m.wikipedia.org/wiki/Comparison_of_Unicode_encodings

27th Sep 2020, 12:27 PM

Ananiya Jemberu

+ 6

UTF-8 = 8 bits variable Length UTF-16 = 16 bits variable Length May it'll be Helpful & easy to understand 😃😃👼

27th Sep 2020, 2:18 PM

Rishbabh Sharma

+ 5

UTF-8 is identical to ASCII for the values from 0 to 127. UTF-8 does not use the values from 128 to 159. UTF-8 is identical to both ANSI and 8859-1 for the values from 160 to 255. UTF-8 continues from the value 256 with more than 10 000 different characters.

28th Sep 2020, 4:14 PM

Michael Victor