简体繁体 English

如果<meta charset="“utf-8”">表示 JavaScript 使用 utf-8 编码而不是 utf-16

[英]If <meta charset=“utf-8”> means that JavaScript is using utf-8 encoding instead of utf-16

原文 2018-07-23 22:28:08 2 3 javascript/ html/ encoding/ utf-8/ character-encoding

I have been trying to understanding why the need for encoding/decoding to UTF-8 happens all over the place in JavaScript land, and learned that JavaScript uses UTF-16 encoding.我一直试图理解为什么在 JavaScript 土地上到处都需要对 UTF-8 进行编码/解码，并了解到 JavaScript 使用 UTF-16 编码。

Let's talk about Javascript string encoding先说Javascript字符串编码

So I'm assuming that's why a library such as utf8.js exists, to convert between UTF-16 and UTF-8.所以我假设这就是存在诸如utf8.js之类的库以在 UTF-16 和 UTF-8 之间转换的原因。

But then at the end he provides some insights:但最后他提供了一些见解：

Encoding in Node is extremely confusing, and difficult to get right. Node 中的编码非常混乱，而且很难正确处理。 It helps, though, when you realize that Javascript string types will always be encoded as UTF-16, and most of the other places strings in RAM interact with sockets, files, or byte arrays, the string gets re-encoded as UTF-8. It helps, though, when you realize that Javascript string types will always be encoded as UTF-16, and most of the other places strings in RAM interact with sockets, files, or byte arrays, the string gets re-encoded as UTF-8.

This is all massively inefficient, of course.当然，这一切都非常低效。 Most strings are representable as UTF-8, and using two bytes to represent their characters means you are using more memory than you need to, as well as paying an O(n) tax to re-encode the string any time you encounter a HTTP or filesystem boundary.大多数字符串可以表示为 UTF-8，并且使用两个字节来表示它们的字符意味着您使用的 memory 比您需要的要多，并且每次遇到 Z293CZZ 或 2A646FF9785DCF 时都要支付 O(n) 税来重新编码字符串边界。

That reminded me of the <meta charset=“utf-8”> in the HTML <head> , which I never really thought too much about, other than "you need this to get text working properly".这让我想起了 HTML <head>中的<meta charset=“utf-8”> ，除了“你需要这个来让文本正常工作”之外，我从来没有真正想过太多。

Now I'm wondering, which this question is about, if that <meta charset=“utf-8”> tag tells JavaScript to do UTF-8 encoding.现在我想知道，如果那个<meta charset=“utf-8”>标签告诉JavaScript做 UTF-8 编码，这个问题是关于什么的。 That would then mean that when you create strings in JavaScript, they would be UTF-8 encoded rather than UTF-16.这意味着当您在 JavaScript 中创建字符串时，它们将是 UTF-8 编码而不是 UTF-16。 Or if I'm wrong there, what exactly it is doing.或者如果我错了，它到底在做什么。 If it is telling JavaScript to use UTF-8 encoding instead of UTF-16 (which I guess would be considered the "default"), then that would mean you don't need to be paying that O(n) tax over doing conversions between UTF-8 and UTF-16, which would mean a performance improvement.如果它告诉 JavaScript 使用 UTF-8 编码而不是 UTF-16（我猜这将被视为“默认”），那么这意味着您不需要为 ZAE37D3DF6970B4B99 之间的转换支付O(n)税和 UTF-16，这意味着性能提升。 Wondering if I am understanding correctly, or if not, what I am missing.想知道我是否理解正确，或者如果没有，我错过了什么。

3 个解决方案

Charset in meta元中的字符集

The <meta charset=“utf-8”> tag tells HTML (less sloppily: the HTML parser) that the encoding of the page is utf8. <meta charset=“utf-8”>标签告诉HTML （不太草率：HTML 解析器）页面的编码是 utf8。

JS does not have a built-in facility to switch between different encondings of strings - it is always utf-16. JS 没有内置工具来在不同的字符串编码之间切换——它总是 utf-16。

Asymptotic bounds渐近界

I do not think that there is a O(n) penalty for encoding conversions.我认为编码转换没有O(n)惩罚。 Whenever this kind of encoding change is due, there already is an O(n) operation: reading/writing the data stream.每当这种编码更改到期时，就已经有一个O(n)操作：读取/写入数据流。 So any fixed number of operations on each octet would still be O(n) .因此，每个八位字节上的任何固定数量的操作仍然是O(n) 。 Encoding change requires local knowledge only, ie.编码更改只需要本地知识，即。 a look-ahead window of fixed length only, and can thus be incorporated in the stream read/write code with a penalty of O(1) .一个只有固定长度的前瞻窗口，因此可以被合并到流读/写代码中，代价是O(1) 。

You could argue that the space penalty is O(n) , though if there is the need to store the string in any standard encoding (ie. without compression), the move to utf-16 means a factor of 2 at max thus staying within the O(n) bound.您可能会争辩说空间损失是O(n) ，但如果需要以任何标准编码（即不压缩）存储字符串，则移动到 utf-16 意味着最大为 2 的因子，因此保持在O(n)界限。

Constant factors常数因子

Even if the concern is minimizing the constant factors hidden in O(n) notation encoding change have a modest impact, in the time domain at least.即使关注最小化隐藏在O(n)符号中的常数因素，编码变化也会产生适度的影响，至少在时域中是这样。 Writing/reading a utf-16 stream as utf-8 for the most part of (Western) textual data means skipping every second octet / inserting null octets.对于大部分（西方）文本数据，将 utf-16 流写入/读取为 utf-8 意味着跳过每隔一个八位字节/插入空八位字节。 That performance hit pales in comparison with the overhead and the latency stemming from interfacing with a socket or the file system.与来自与套接字或文件系统接口的开销和延迟相比，这种性能损失相形见绌。

Storage is different, of course, though storage is comparatively cheap today and the upper bound of 2 still holds.当然，存储是不同的，尽管今天存储相对便宜并且 2 的上限仍然成立。 The move from 32 to 64 bit has a higher memeory impact wrt to number representations and pointers.从 32 位到 64 位的移动对数字表示和指针具有更高的内存影响。

JavaScript uses UTF-16 JavaScript 使用 UTF-16

HTML5 uses UTF-8 HTML5 使用 UTF-8

Your meta tag setting applies to HTML5 encoding, which is optional as most modern browsers know HTML5 is UTF-8.您的元标记设置适用于 HTML5 编码，这是可选的，因为大多数现代浏览器都知道 HTML5 是 UTF-8。 It has nothing to do with JavaScript encoding, however, and does not change or affect JavaScript except to tell it to decode your page using UTF-8 encoding.然而，它与 JavaScript 编码无关，并且不会改变或影响 JavaScript，只是告诉它使用 UTF-8 编码解码您的页面。

The way most modern Javascript engines work is YES they do read and decode UTF-8 script, HTML markup, and page text into UTF-16.大多数现代 Javascript 引擎的工作方式是，它们确实将 UTF-8 脚本、HTML 标记和页面文本读取和解码为 UTF-16。 But for speed and other reasons, they often store the first ASCII set (English characters and numbers) in its native form, or as one byte just as UTF-8 or your web page does.但是出于速度和其他原因，它们通常以其本机形式存储第一个 ASCII 集（英文字符和数字），或者像 UTF-8 或您的网页那样存储为一个字节。 Its not a hard and fast rule.这不是硬性规定。 So HTML tags read and stored by Javascript in say Chrome's V8 javascript engine might still store them in one byte, not UTF-16.因此，在 Chrome 的 V8 javascript 引擎中由 Javascript 读取和存储的 HTML 标签可能仍将它们存储在一个字节中，而不是 UTF-16。

What is happening under the covers of these scripting engines in terms of most ASCII characters stored in UTF-8 isn't something you should worry about.就以 UTF-8 存储的大多数 ASCII 字符而言，在这些脚本引擎的掩护下发生的事情不是您应该担心的。 You only run into issues when streaming more complex upper "planes" of Unicode characters.只有在流式传输更复杂的 Unicode 字符上层“平面”时才会遇到问题。 The UTF-16 characteristics of Javascript storage and encoding are variable, I have read. Javascript 存储和编码的 UTF-16 特性是可变的，我已经阅读过。 Its not something most web developers need to worry about, in my opinion, until you get into upper level Unicode languages and character set manipulation in Javascript.在我看来，大多数 Web 开发人员不需要担心，直到您进入高级 Unicode 语言和 Javascript 中的字符集操作。 That is what Node and many open source engines have struggled at in terms of decoding and encoding UTF-8 and UTF-16 because of their reliance on Javascripting engines.这就是 Node 和许多开源引擎在解码和编码 UTF-8 和 UTF-16 方面一直在努力的地方，因为它们依赖于 Javascripting 引擎。

Again, because everything is moving towards UTF-8 encoding now (where 1-4 bytes are optionally used to encode the complete Unicode character set versus UTF-16 which starts at 2-bytes sets and goes up) you will see Javascript handle all that decoding of UTF-8 into UTF-16 and back out as a pretty seamless process with lots of contingency in place.同样，因为现在一切都朝着 UTF-8 编码发展（其中 1-4 个字节可选地用于编码完整的 Unicode 字符集，而 UTF-16 从 2 个字节集开始并上升）您将看到 Javascript 处理所有这些将 UTF-8 解码为 UTF-16 并作为一个非常无缝的过程退出，并有很多意外情况。

BTW....the way scripting engines read or figure out your Javascript files encoded in UTF-8, is Javascript first listens to the mime type or "content-type" and charset in the HTTP header coming from the server to see what all the web page files should be decoded from.顺便说一句……脚本引擎读取或找出以 UTF-8 编码的 Javascript 文件的方式，是 Javascript 首先侦听来自服务器的 HTTP 标头中的 MIME 类型或“内容类型”和字符集，以查看所有内容应从中解码网页文件。 As mentioned that's almost always UTF-8 now in HTML5.如前所述，现在在 HTML5 中几乎总是 UTF-8。 If it cannot determine the type it next checks your script's <script> tag and its custom type attributes for both mime type and/or charset to see what if your javascript source file has set that type.如果它无法确定类型，它接下来会检查您的脚本的<script>标签及其自定义类型属性，以查看 mime 类型和/或字符集，以查看您的 javascript 源文件是否设置了该类型。 In most cases those are missing.在大多数情况下，这些都缺失了。 Lastly, it checks the web pages meta tag charset which is either UTF-8 or if HTML5 is used, it assumes UTF-8.最后，它检查网页元标记字符集，该字符集是 UTF-8，或者如果使用 HTML5，则它假定为 UTF-8。 There is also the "byte order mark" on the script file which likely is UTF-8.脚本文件上还有可能是 UTF-8 的“字节顺序标记”。 Even if its encoded in ASCII or say Latin-1 that translates directly into UTF-8, anyway.即使它是用 ASCII 编码的，或者说直接转换为 UTF-8 的 Latin-1，无论如何。 Once the encoding is known, Javascript then decodes the bits and encodes them into its own 2-byte set as mentioned above.一旦知道编码，Javascript 就会解码这些位并将它们编码成它自己的 2 字节集，如上所述。

At the end of the day the engines do a good job of negotiating all this for you.在一天结束时，引擎会为您很好地协商所有这些。

Re "meta charset=“utf-8”"... another sign about how sloppy the standards bodies that build the web can be.重新“meta charset=”utf-8””...另一个迹象表明构建 web 的标准机构可能是多么草率。 This has nothing whatsoever to do with character sets.这与字符集无关。 It's encodings of glyphs.它是字形的编码。 A character set is more closely related to an alphabet or a language than to an encoding.字符集与字母或语言的关系比与编码的关系更密切。 HTML got it as wrong as you can get. HTML 把它弄错了。