I'm building a small parser that scrapes web pages and logs the data on them. One of the things to log is the post title of forums. I'm using a XML parser to look through the DOM and get this information, and I'm storing it like this:
// Strip out the post's title
$title = $page->find('a[rel=bookmark]', 0);
$title = htmlspecialchars_decode(html_entity_decode(trim($title->plaintext)));
This works for the most part, but some posts have certain special HTML character codes like –
which is dash ( -
). How would I go about converting these special character codes back into their original strings?
Thanks.
Use html_entity_decode . Here's a quick example.
$string = "hyphenated–words";
$new = html_entity_decode($string);
echo $new;
You should see...
hyphenated–words
Documentation is your friend:
html_entity_decode(trim($title->plaintext), ENT_XHTML, YOUR_ENCODING);
^^^^^^^^^^^^^^^^^^^^^^^^
This might help:
<?php
function clean_up($str){
$str = stripslashes($str);
$str = strtr($str, get_html_translation_table(HTML_ENTITIES));
$str = str_replace( array("\x82", "\x84", "\x85", "\x91", "\x92", "\x93", "\x94", "\x95", "\x96", "\x97"), array("‚", "„", "…", "‘", "’", "“", "”", "•", "–", "—"),$str);
return $str;
}
?>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.