简体   繁体   English

UTF-8 是编码还是字符集?

[英]Is UTF-8 an encoding or a character set?

I thought that the name of the character set was "Unicode" and that "UTF-8" was the name of a particular encoding of the Unicode character set, but I often see the terms "encoding" and "charset" used interchangeably when referring to UTF-8.我认为字符集的名称是“Unicode”,而“UTF-8”是 Unicode 字符集的特定编码的名称,但我经常看到术语“编码”和“字符集”在提及时互换使用到 UTF-8。

For example,例如,

<meta charset="UTF-8">

vs对比

<?xml version="1.0" encoding="UTF-8" ?>

Is UTF-8 an encoding or a character set? UTF-8 是编码还是字符集?

UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below. UTF-8 是一种编码,该术语在定义它的 RFC 中使用,下面引用。


I often see the terms "encoding" and "charset" used interchangeably我经常看到术语“编码”和“字符集”互换使用

Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet.在 Unicode 出现之前,如果您想使用像西里尔字母或希腊字母这样的字母†,您需要使用一种仅编码为该字母中字符的编码。 Thus, the terms encoding and charset were often conflated but they mean different things.因此,术语编码字符集经常被混为一谈,但它们的含义不同。

Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.但是现在,Unicode 通常是您需要担心的唯一字符集,因为它包含您必须处理的大多数书面语言的字符,克林贡语除外。

† - Alphabet, a kind of character set where characters correspond directly to sounds in a spoken language. † - 字母表,一种字符集,其中字符直接对应于口语中的声音。


A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language.字符集是从代码单元(整数)到字符、符号、字形或书面语言中的其他标记的映射。 Unicode is a character set that maps 21b integers to unicode codepoints. Unicode 是将 21b 整数映射到 unicode 代码点的字符集。 The Unicode Consortium's glossary describes it thus: Unicode Consortium 的词汇表是这样描述的:

Unicode统一码

  1. The standard for digital representation of the characters used in writing all of the world's languages.用于书写世界上所有语言的字符的数字表示标准。 Unicode provides a uniform means for storing, searching, and interchanging text in any language. Unicode 提供了一种统一的方式来存储、搜索和交换任何语言的文本。 It is used by all modern computers and is the foundation for processing text on the Internet.它被所有现代计算机使用,是处理 Internet 文本的基础。 Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org . Unicode 由 Unicode Consortium 开发和维护: http : //www.unicode.org
  2. A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.应用于由 Unicode Consortium 开发和维护的软件国际化和本地化标准的标签。

An encoding is a mapping from strings to strings.编码是从字符串到字符串的映射。 UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). UTF-8 是一种将字节字符串(8b 整数)映射到代码点字符串(21b 整数)的编码。 The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629 . Unicode Consortium 将其称为“字符编码方案”,并在RFC 3629 中进行了定义。

The originally proposed encodings of the UCS, however, were not compatible with many current applications and protocols, and this has led to the development of UTF-8然而,最初提出的 UCS 编码与许多当前的应用程序和协议不兼容,这导致了 UTF-8 的发展

UTF-8 is an encoding , in the sense that it encodes a sequence of abstract integers – the unicode codepoints which indicate abstract characters – into a set of bytes. UTF-8是一种编码,在其编码抽象整数序列有义-的unicode的码点其指示抽象字符-成一组字节。 (Through unicode spectacles, you could say that a 'character set' such as ISO-8859-1 is also a table-driven 'encoding', in the sense that it encodes a small number of codepoints as bytes, but this is verging towards an abuse of terminology, and probably isn't very helpful). (通过 unicode 眼镜,您可以说像ISO-8859-1这样的“字符集”也是表驱动的“编码”,因为它将少量代码点编码为字节,但这接近于滥用术语,可能不是很有帮助)。

The sequence of integers is (in some fundamental sense) the 'unicode string', but in order to save these on a disk or send them over a network, you need to encode them as a sequence of bytes.整数序列(在某种基本意义上)是“unicode 字符串”,但为了将它们保存在磁盘上或通过网络发送它们,您需要将它们编码为字节序列。 UTF-8 is one way of doing that, UTF-16 is another: one unicode string will be represented as two different streams of bytes if it's encoded in the two different ways. UTF-8 是一种方法,UTF-16 是另一种:如果以两种不同的方式编码,一个 unicode 字符串将表示为两个不同的字节流。


There are multiple fine answers here, but just yesterday I spent some time trying to boil this issue down to some minimal size, so this provides a happy opportunity to reuse that text:这里有多个很好的答案,但就在昨天,我花了一些时间试图将这个问题归结为最小的大小,因此这提供了一个重用该文本的好机会:

Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is quite good, I think.我认为,Joel Spolsky 关于每个软件开发人员绝对、肯定必须了解 Unicode 和字符集(没有借口!)的绝对最小值的文章非常好。 It's (surely) been mentioned here before, but it bears repeating.之前(肯定)在这里提到过,但值得重复。 I think it's not completely minimal, though.不过,我认为这并不完全是最小的。

On the couple of occasions when I've had to explain 'unicode' to a colleague, it's been the notion of the abstract Unicode codepoints that's turned out key to the illumination.有几次我不得不向同事解释“unicode”时,抽象 Unicode 代码点的概念被证明是照明的关键。 The structure of my successful explanations has been something like this:我成功的解释的结构是这样的:

  • The Unicode consortium has (with much agonising and negotiation) managed to give a number to a large fraction of the characters in use. Unicode 联盟(经过许多痛苦和协商)设法为大部分正在使用的字符提供了一个数字 These numbers are (jargon) called 'codepoints'.这些数字(行话)称为“代码点”。

  • 'The Letter A' has a codepoint, and this is independent of fonts. 'The Letter A' 有一个代码点,这与字体无关。 Thus 'A' and 'a' have different codepoints, but roman, bold, italic, serif, sans serif (et very much cetera) are not distinguished.因此,'A' 和 'a' 具有不同的代码点,但不区分罗马、粗体、斜体、衬线、无衬线(等等)。 Japanese kanji, tengwar and klingon characters (for example) have codepoints (this gets attention).日语 kanji、tengwar 和 klingon 字符(例如)有代码点(这会引起注意)。

  • A 'unicode string' is (conceptually) a sequence of codepoints. 'unicode string' 是(概念上的)代码点序列。 This is a sequence of mathematical integers.这是一个数学整数序列。 It does not make sense to ask whether these are bytes, 2-byte or 4-byte words;询问这些是字节、2 字节还是 4 字节字是没有意义的; the sequence has nothing to do with computers.该序列与计算机无关。

  • If, however, you want to send that sequence of integers to someone, or save it on a computer disk, you have to do something to encode it.但是,如果您想将该整数序列发送给某人,或将其保存在计算机磁盘上,则必须对其进行编码。 You could also write down the sequence of numbers on a piece of paper, but let's specialise to computers at this point.您也可以在一张纸上写下数字序列,但现在让我们专门研究计算机。 If you want to store or send this on a computer, you have to transform these integers into a sequence of bytes.如果要在计算机上存储或发送它,则必须将这些整数转换为字节序列。 There are multiple procedures for doing that, and each of these procedures is named an 'encoding'.有多个过程可以做到这一点,并且这些过程中的每一个都被命名为“编码”。 One of these 'encodings' is UTF-8.这些“编码”之一是 UTF-8。

  • When you 'read a Unicode file', you are starting with a sequence of bytes, on disk, and conceptually ending up with a sequence of integers.当您“读取 Unicode 文件”时,您从磁盘上的字节序列开始,并在概念上以整数序列结束。 If the 'unicode file' is indicated, somehow, to be encoded in UTF-8, then you have to decode that sequence of bytes to get the sequence of integers, using the algorithm defined in RFC 3629 .如果以某种方式指示“unicode 文件”以 UTF-8 编码,则您必须使用RFC 3629 中定义的算法对该字节序列进行解码以获取整数序列。 All of the subsequent operations on the 'unicode string' are defined in terms of the sequence of codepoints, and the fact that it started off, on disk, as 'UTF-8' is forgotten. 'unicode string' 上的所有后续操作都是根据代码点序列定义的,并且忘记了它在磁盘上作为 'UTF-8' 开始的事实。

The Unicode Standard calls it an encoding form or an encoding scheme . Unicode 标准将其称为编码形式编码方案 Unicode has a single set of characters (known as the Unicode character set, or Universal Character Set), and all the UTF encoding forms and encoding schemes can encode all the characters in that set. Unicode 有一组字符(称为 Unicode 字符集或通用字符集),所有 UTF编码形式编码方案都可以编码该集中的所有字符。

As happens with many other terms, programmers seem to have a tendency to just misappropriate terms here and there, and this is just one more instance of this.与许多其他术语一样,程序员似乎倾向于到处乱用术语,这只是另一个例子。

UTF-8 is an encoding. UTF-8 是一种编码。 Encodings are however often called character sets, and many protocols therefore use the parameter name charset for a parameter that specifies character encoding.然而,编码通常称为字符集,因此许多协议使用参数名称charset作为指定字符编码的参数。 As such, charset is just an identifier.因此, charset只是一个标识符。

From all possible sources, UTF-8 is named as an encoding , not a charset , period.从所有可能的来源来看,UTF-8 被命名为encoding ,而不是charset ,句点。

However it was defined by Unicode Standard to primarily encode Unicode charset.然而,它是由 Unicode 标准定义的,主要用于对 Unicode 字符集进行编码。 Just check what the UTF acronym means: Unicode Transformation Format .只需检查 UTF 首字母缩略词的含义: Unicode Transformation Format It even gives some backward compatibility with some previous charset like ASCII.它甚至提供了与一些以前的字符集(如 ASCII)的向后兼容性。 So from a practical point of view, it would be very unusual to use UTF-8 to encode a charset other than Unicode.因此,从实用的角度来看,使用 UTF-8 对 Unicode 以外的字符集进行编码是非常不寻常的。

This might be the root of the inaccurate use of UTF-8 as a charset in some contexts.这可能是在某些情况下不准确地将 UTF-8 用作字符集的根源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM