What are UTF-8 and UTF-16? Working with Unicode encodings

What are UTF-8 and UTF-16? Working with Unicode encodings

28.620 Lượt nghe
What are UTF-8 and UTF-16? Working with Unicode encodings
UTF-8 and UTF-16 are the two most commonly used encoding for Unicode characters. Unicode defines a large character repertoire (1.1 million in theory, of which 145k are defined in Unicode 14.0) which begs the question how to encode all these characters. UTF-8 and UTF-16 are two of the encodings that Unicode defines, and the most popular ones today. UTF-8 is a variable length encoding that encodes each character in 1-4 bytes, where the standard ASCII repertoire is encoded in 1 byte per character. This encoding makes UTF-8 compact, but it also is a relatively complex encoding. UTF-16 is less complex and encodes most Unicode characters (and pretty much all in practical use today) in 2 bytes, with some others being encoded in 4 bytes. This means UTF-16 takes up more space for most cases, but it is easier to encode and decode. Since Unicode is very popular today a lot of tooling has built in support for Unicode and some of its encodings. In addition, there are standalone tools that can be used to investigate files, and to convert them. We demonstrate two such tools with the Unix "od" and "iconv" commands, which allow us to have a close look at a demo file, and to convert it between the two encodings. Additional Resources: 🎥 What is Unicode? How does it work and how do you use it? - https://www.youtube.com/watch?v=ngr0SIrfz6M 👉 Wikipedia: UTF-8 - https://en.wikipedia.org/wiki/UTF-8 👉 Wikipedia: UTF-16 - https://en.wikipedia.org/wiki/UTF-16 👉 Wikipedia: Unicode - https://en.wikipedia.org/wiki/Unicode 👉 Unicode Consortium - https://home.unicode.org/ 00:00 Introduction 00:23 UTF-8 and UTF-16 are Text Encodings 00:55 Character Sets 01:26 Unicode as the universal character repertoire 02:27 UTF-8 03:25 UTF-16 04:09 Demo time: Starting with a demo file 04:50 od as a tool for dumping files 05:46 iconv for converting files 07:24 Summary 08:50 Wrap-up