简体   繁体   中英

utf-8 and htmlentities in RSS feeds

I'm writing some RSS feeds in PHP and stuggling with character-encoding issues. Should I utf8_encode() before or after htmlentities() encoding? For example, I've got both ampersands and Chinese characters in a description element, and I'm not sure which of these is proper:

$output = utf8_encode(htmlentities($source)); or
$output = htmlentities(utf8_encode($source));

And why?

It's important to pass the character set to the htmlentities function, as the default is ISO-8859-1:

utf8_encode(htmlentities($source,ENT_COMPAT,'utf-8'));

You should apply htmlentities first as to allow utf8_encode to encode the entities properly.

(EDIT: I changed from my opinion before that the order didn't matter based on the comments. This code is tested and works well).

First: The utf8_encode function converts from ISO 8859-1 to UTF-8. So you only need this function, if your input encoding/charset is ISO 8859-1. But why don't you use UTF-8 in the first place?

Second: You don't need htmlentities . You just need htmlspecialchars to replace the special characters by character references. htmlentities would replace “too much” characters that can be encoded directly using UTF-8. Important is that you use the ENT_QUOTES quote style to replace the single quotes as well.

So my proposal:

// if your input encoding is ISO 8859-1
htmlspecialchars(utf8_encode($string), ENT_QUOTES)

// if your input encoding is UTF-8
htmlspecialchars($string, ENT_QUOTES, 'UTF-8')

Don't use htmlentities() !

Simply use UTF-8 characters. Just make sure you declare encoding of the feed in HTTP headers ( Content-Type:application/xml;charset=UTF-8 ) or failing that, in the feed itself using <?xml version="1.0" encoding="UTF-8"?> on the first line.

It might be easier to forget htmlentities and use a CDATA section. It works for the title section, which doesn't seem support encoded HTML characters in Firefox's RSS viewer:

<title><![CDATA[News & Updates  " > » ☂ ☺ ☹ ☃  Test!]]></title>

You want to do $output = htmlentities(utf8_encode($source)); . This is because you want to convert your international characters into proper UTF8 first, and then have ampersands (and possibly some of the UTF-8 characters as well) turned in to HTML entities. If you do the entities first, then some of the international characters may not be handled properly.

If none of your international characters are going to be changed by utf8_encode, then it doesn't matter which order you call them in.

After much trial & error, I finally found a way to properly display a string from a utf8-encoded database value, through an xml file, to an html page:

$output = '<![CDATA['.utf8_encode(htmlentities($string)).']]>';

I hope this helps someone.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM