简体   繁体   中英

charset detection, meta vs header

We recently ran into some troubles when trying to determine the correct encoding used for a page. We have encounter a page with following setup:

header response:

Content-Type:text/html; charset=GBK

meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

Actual content is in GBK, modern browsers are smart enough to use the right encoding for this page.

But for a crawler (using curl), we are forced to decide picking one charset value over the other. So my question is: Is taking header charset over meta charset the normal thing to do ?

(Most content-based encoding detection algorithm we have tried are shaky at best, as long as one charset is more reliable than the other, we prefer using specified charset over anything from our own encoding detection.)

Is taking header charset over meta charset the normal thing to do?

Yes. See the specification .

HTTP headers are checked at step 4. Meta isn't examined until step 5 (if it appears soon enough in the file) or step 9 (otherwise).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM