简体繁体中英

What is difference between UTF-8 and HTML entities?

原文 2010-05-18 18:51:18 7 5 php/ utf-8/ html-entities

UTF-8和HTML实体有什么区别？

5 answers

UTF-8 is an encoding scheme for byte-level encoding.

HTML entities provide a way to express many characters in the standard (usually ASCII) character space. It also makes them ~~more human readable~~ readable when UTF-8 is not available.

The main purpose of HTML Entities today is to make sure text that looks like HTML renders as text. For example, the Less than or Greater than operators ( < or > ) when placed in a certain order (ie <text>) can accidentally render as HTML when the intent was for them to render as text.

See UTF-8 more as a means to losslessly and self-synchronising map a list of natural numbers to a bytestream so that you can get the natural numbers back (lossless) and if you just fall 'in the middle' of the stream that's not a big problem. (self-synchronizing)

Each natural number just happens to represent a 'character'.

HTML entities is a way to represent those same natural numbers in a way like:  , stands for the natural number 127, in unicode that being the DEL character.

In UTF-8 that's the bytestream: 0111 1111

Once you go above 127 it becomes more than one octet, therefore, 128 becomes: 1000 0001 1111 1111 .

Two DEL chars in a row become 0111 1111 0111 1111 . UTF-8 is designed in such a way, that it's always possible to retrieve the original list of 'unicode scalar values' from the bytestream, even though a bytestream of for instance 4 octets can map back to between 1 and 4 different of such scalar values. UTF-8 is thus 'variable length' as they call it.

The "A" you see here on screen is not actually stored as "A" in the computer, it's rather a sequence of 1's and 0's. A character set or encoding specifies a way to encode characters in such a way. The ASCII character set only includes a handful of characters it can encode, almost exclusively limited to characters of the English language. But for historical reasons and technical limitations of the time, it used to be the character set of the internet (very early on).

Both UTF-8 and HTML entities can be used to encode characters that are not part of ASCII. HTML entities achieve this by giving a special meaning to special sequences of characters. Using it you can encode characters not covered by ASCII using only ASCII characters. UTF-8 (Unicode) does the same by simply extending the character set to include more characters. HTML entities are only "valid" in an environment where you bother to decode them, which is usually a browser. UTF-8 characters are universal in any application that supports the character set.

Text containing only characters covered by ASCII:

Price: $20 (UTF-8)
Price: $20 (ASCII with HTML entities)

Text containing European characters not covered by ASCII:

Beträge: 20€ (UTF-8)
Beträge: 20€ (ASCII with HTML entities)

Text containing Asian characters, most certainly not covered by ASCII:

値段：二千円 (UTF-8)
値段：二千円 (ASCII with HTML entities)

The problem with UTF-8 is that the client needs to understand UTF-8. For the last decade or so this has been of no concern though, as all modern computers and browsers have no problem understanding UTF-8. UTF-8 (Unicode) can encode virtually all characters in use today on this planet (with minor exceptions). Using it you can work with text "as-is". It should absolutely be the preferred encoding to save text in.

The problem with HTML entities is that normal characters take on a special meaning. When writing ä , it takes on the special meaning of "ä". If you actually intend to write "ä", you need to double encode the sequence as &auml; .
HTML entities are also notoriously unreadable. You do not want to use them to encode "special" characters in normal text. In this capacity they're a kludge bolted onto an inadequate character set. Use Unicode instead.

The important use of HTML entities that is independent of the character set used is to separate HTML markup from text. HTML as well gives special meaning to special character sequences. text is a normal sequence of characters, but it has a special meaning for HTML parsers. If you intended to just write "text", you will need to encode it as text , so the HTML parser doesn't mistake it for HTML tags.

A ton. HTML entities are primarily intended there to escape HTML-markup so it can be displayed in HTML (not mix up display vs output). For instance, > outputs a >, while > closes a tag. While you can produce full Unicode with HTML entities, it is very inefficient and downright ugly.

UTF-8 is a multi-byte encoding for Unicode, which covers how to display characters outside of the classic US ASCII code page without resorting to switching code pages and attempting to mix code pages. A single code point (think of it as a character, though that is not truly correct) can be made up of 6 bytes of data. It is for representing any character in and outside of the basic multilingual plane (BMP), such as accented characters, east asian characters, as well as celtic tree writing (Ogham) amongst other character sets.

UTF-8 is an encoding, htmlentities is a function for making user input safe to display on the page, so that HTML tags are not added directly to the markup. See the manual .

UTF-8 and HTML entities

Convert html entities to UTF-8, but keep existing UTF-8

Converting accented characters and HTML entities into UTF-8?

html entities for utf-8 character in php

Converting from HTML entities to UTF-8

What is a replacement for mb_convert_encoding($string, 'utf-8', 'HTML-ENTITIES');?

PHP: convert all UTF-8 characters to HTML entities

How to convert html entities to utf-8 when upload to database? PHP

Converting html entities to utf-8 and inserting them into a mysql database

Replace UTF-8 string from email body to HTML entities

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question UTF-8 and HTML entities Convert html entities to UTF-8, but keep existing UTF-8 Converting accented characters and HTML entities into UTF-8? html entities for utf-8 character in php Converting from HTML entities to UTF-8 What is a replacement for mb_convert_encoding($string, 'utf-8', 'HTML-ENTITIES');? PHP: convert all UTF-8 characters to HTML entities How to convert html entities to utf-8 when upload to database? PHP Converting html entities to utf-8 and inserting them into a mysql database Replace UTF-8 string from email body to HTML entities

Related Tags

What is difference between UTF-8 and HTML entities?

Question

5 answers

solution1
4 2010-05-18 18:54:00

solution2
3 ACCPTED 2010-05-18 19:00:14

solution3
3 2010-05-19 02:08:21

solution4
2 2010-05-18 18:54:13

solution5
0 2010-05-18 18:55:20

What is difference between UTF-8 and HTML entities?

Question

5 answers

solution1 4 2010-05-18 18:54:00

solution2 3 ACCPTED 2010-05-18 19:00:14

solution3 3 2010-05-19 02:08:21

solution4 2 2010-05-18 18:54:13

solution5 0 2010-05-18 18:55:20

solution1
4 2010-05-18 18:54:00

solution2
3 ACCPTED 2010-05-18 19:00:14

solution3
3 2010-05-19 02:08:21

solution4
2 2010-05-18 18:54:13

solution5
0 2010-05-18 18:55:20