简体   繁体   中英

How should I deal with character encodings when storing crawled web content for a search engine into a MySQL database?

I have a crawler that downloads webpages, scrapes specific content and then stores that content into a MySQL database. Later that content is displayed on a webpage when it's searched for ( standard search engine type setup ).

The content is generally of two different encoding types... UTF-8 or ISO-8859-1 or it is not specified. My database tables use cp1252 west european ( latin1 ) encoding. Up until now, I've simply filtered all characters that are not alphanumeric, spaces or punctuation using a regular expression before storing the content to MySQL. For the most part, this has eliminated all character encoding problems, and content is displayed properly when recalled and outputted to HTML. Here is the code I use:

function clean_string( $string )
{

    $string = trim( $string );

    $string = preg_replace( '/[^a-zA-Z0-9\s\p{P}]/', '', $string );

    $string = $mysqli->real_escape_string( $string );

    return $string;

}

I now need to start capturing "special" characters like trademark, copyright, and registered symbols, and am having trouble. No matter what I try, I end up with weird characters when I redisplay the content in HTML.

From what I've read, it sounds like I should use UTF-8 for my database encoding. How do I ensure all my data is converted properly before storing it to the database? Remember that my original content comes from all over the web in various encoding formats. Are there other steps I'm overlooking that may be giving me problems?

You should convert your database encoding to UTF-8.

About the content: for every page you crawl, fetch the page's encoding (from HTTP header/ meta charset) and use that encoding to convert to utf-8 like this:

$string = iconv("UTF-8", "THIS STRING'S ENCODING", $string);

Where THIS STRING'S ENCODING is the one you just grabbed as described above.

PHP manual on iconv: http://be2.php.net/manual/en/function.iconv.php

Below worked for me when I am scraping and presenting the data on html page.

  1. While scraping the data from external website do an utf8_encode: utf8_encode(trim(str_replace(array("\\t","\\n\\r","\\n","\\r"),"",trim($th->plaintext))));
  2. Before writing to the HTML page set the charset to utf-8 : <meta charset="UTF-8">
  3. While writing of echoing out on html do an utf8_decode. echo "Menu Item:". utf8_decode ($value['item'])

This helped me to solve problem with my html scraping issues. Hope someone else finds it useful.

UTF-8 encompasses just about everything. It would definitely be my choice.

As far as storing the data, just ensure the connection to your database is using the proper charset. See the manual .

To deal with the ISO encoding, simply use utf8_encode when you store it, and utf8_decode when you retrieve it.

Try doing the encoding/decoding even when it's supposedly UTF-8 and see if that works for you. I've often seen people say something is UTF-8 when it isn't.

You'll also need to change your database to UTF-8.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM