简体   繁体   中英

java web crawler cannot recognize non English characters

I crawled list of movies and stored them in my database. Everything works fine for movies which contain only English characters but the problem is that some of movie names that contain non English characters cannot be displayed correctly. For example, the Italian movie "Il più crudele dei giorni" is stored as "Il pi& ugrave; crudele dei giorni".

Could someone kindly let me know if there is any solution? (I know that I can set the language for the crawler, I already crawled movie titles in Italian as well, but when I want to crawl English titles, there are still some movies in Imdb which has non English characters)

EDIT: Here is my code:

String baseUrl = "http://www.imdb.com/search/title?at=0&count=250&sort=num_votes,desc&start="+start+"&title_type=feature&view=simple";

label1:  try {

     org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").header("Accept-Language", "en");
     con.timeout(30000).ignoreHttpErrors(true).followRedirects(true);
     Response resp = con.execute();
     Document doc = null;

     if (resp.statusCode() == 200) {

         doc = con.get();                                       

         Elements myElements = doc.getElementsByClass("results").first().getElementsByTag("table");
         Elements trs = myElements.select(":not(thead) tr");

         for (int i = 0; i < trs.size(); i++) {

             Element tr = trs.get(i);
             Elements tds = tr.select("td");

             for (int j = 3; j < tds.size(); j++) {

                 Elements links = tds.select("a[href]");
                 String titleId = links.attr("href");
                 String movietitle = links.html();    

                  //I ADDED YOUR CODE HERE
                   Charset c = Charset.forName("UTF-16BE");

                        ByteBuffer b = c.encode(movietitle);
                        for (int m = 0; b.hasRemaining(); m++) {
                            int charValue = (b.get()) & 0xff;
                            System.out.print((char) charValue);
                        }   

               // try{    

                //   String query = "INSERT into test (movieName,ImdbId)" + "VALUES (?,?)";
    //               PreparedStatement preparedStmt = conn.prepareStatement(query);
    //               preparedStmt.setString (1, movietitle);
      //               preparedStmt.setString (2, titleId );
       //          }catch (Exception e)
        //       {
        //           e.printStackTrace();
        //       }

Thanks,

Here, I copy pasted the string shared in the question and tried

public class Test {
    public static void main (String...a) throws Exception {
        String s = "Il più crudele dei giorni";
        Charset c = Charset.forName("UTF-16BE");

        ByteBuffer b = c.encode(s);
        for (int i = 0; b.hasRemaining(); i++) {
            int charValue = (b.get()) & 0xff;
            System.out.print((char) charValue);
        }
    }
}

This prints the s as it is on the console. I assume that you already have part of code which writes to a file. You can try integrating the above code if it works for you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM