简体   繁体   中英

What is difference between UTF-8 and HTML entities?

UTF-8和HTML实体有什么区别?

UTF-8 is an encoding scheme for byte-level encoding.

HTML entities provide a way to express many characters in the standard (usually ASCII) character space. It also makes them more human readable readable when UTF-8 is not available.

The main purpose of HTML Entities today is to make sure text that looks like HTML renders as text. For example, the Less than or Greater than operators ( &lt; or &gt; ) when placed in a certain order (ie <text>) can accidentally render as HTML when the intent was for them to render as text.

See UTF-8 more as a means to losslessly and self-synchronising map a list of natural numbers to a bytestream so that you can get the natural numbers back (lossless) and if you just fall 'in the middle' of the stream that's not a big problem. (self-synchronizing)

Each natural number just happens to represent a 'character'.

HTML entities is a way to represent those same natural numbers in a way like: &#127; , stands for the natural number 127, in unicode that being the DEL character.

In UTF-8 that's the bytestream: 0111 1111

Once you go above 127 it becomes more than one octet, therefore, 128 becomes: 1000 0001 1111 1111 .

Two DEL chars in a row become 0111 1111 0111 1111 . UTF-8 is designed in such a way, that it's always possible to retrieve the original list of 'unicode scalar values' from the bytestream, even though a bytestream of for instance 4 octets can map back to between 1 and 4 different of such scalar values. UTF-8 is thus 'variable length' as they call it.

The "A" you see here on screen is not actually stored as "A" in the computer, it's rather a sequence of 1's and 0's. A character set or encoding specifies a way to encode characters in such a way. The ASCII character set only includes a handful of characters it can encode, almost exclusively limited to characters of the English language. But for historical reasons and technical limitations of the time, it used to be the character set of the internet (very early on).

Both UTF-8 and HTML entities can be used to encode characters that are not part of ASCII. HTML entities achieve this by giving a special meaning to special sequences of characters. Using it you can encode characters not covered by ASCII using only ASCII characters. UTF-8 (Unicode) does the same by simply extending the character set to include more characters. HTML entities are only "valid" in an environment where you bother to decode them, which is usually a browser. UTF-8 characters are universal in any application that supports the character set.

Text containing only characters covered by ASCII:

Price: $20 (UTF-8)
Price: $20 (ASCII with HTML entities)

Text containing European characters not covered by ASCII:

Beträge: 20€ (UTF-8)
Betr&auml;ge: 20&euro; (ASCII with HTML entities)

Text containing Asian characters, most certainly not covered by ASCII:

値段:二千円 (UTF-8)
&#x5024;&#x6BB5;&#xFF1A;&#x4E8C;&#x5343;&#x5186; (ASCII with HTML entities)

The problem with UTF-8 is that the client needs to understand UTF-8. For the last decade or so this has been of no concern though, as all modern computers and browsers have no problem understanding UTF-8. UTF-8 (Unicode) can encode virtually all characters in use today on this planet (with minor exceptions). Using it you can work with text "as-is". It should absolutely be the preferred encoding to save text in.

The problem with HTML entities is that normal characters take on a special meaning. When writing &auml; , it takes on the special meaning of "ä". If you actually intend to write "&auml;", you need to double encode the sequence as &amp;auml; .
HTML entities are also notoriously unreadable. You do not want to use them to encode "special" characters in normal text. In this capacity they're a kludge bolted onto an inadequate character set. Use Unicode instead.

The important use of HTML entities that is independent of the character set used is to separate HTML markup from text. HTML as well gives special meaning to special character sequences. <b>text</b> is a normal sequence of characters, but it has a special meaning for HTML parsers. If you intended to just write "<b>text</b>", you will need to encode it as &lt;b&gt;text&lt;/b&gt; , so the HTML parser doesn't mistake it for HTML tags.

A ton. HTML entities are primarily intended there to escape HTML-markup so it can be displayed in HTML (not mix up display vs output). For instance, &gt; outputs a >, while > closes a tag. While you can produce full Unicode with HTML entities, it is very inefficient and downright ugly.

UTF-8 is a multi-byte encoding for Unicode, which covers how to display characters outside of the classic US ASCII code page without resorting to switching code pages and attempting to mix code pages. A single code point (think of it as a character, though that is not truly correct) can be made up of 6 bytes of data. It is for representing any character in and outside of the basic multilingual plane (BMP), such as accented characters, east asian characters, as well as celtic tree writing (Ogham) amongst other character sets.

UTF-8 is an encoding, htmlentities is a function for making user input safe to display on the page, so that HTML tags are not added directly to the markup. See the manual .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM