简体   繁体   English

Java Web搜寻器无法识别非英文字符

[英]java web crawler cannot recognize non English characters

I crawled list of movies and stored them in my database. 我抓取了电影列表并将其存储在数据库中。 Everything works fine for movies which contain only English characters but the problem is that some of movie names that contain non English characters cannot be displayed correctly. 对于仅包含英文字符的电影,一切正常,但问题是某些包含非英文字符的电影名称无法正确显示。 For example, the Italian movie "Il più crudele dei giorni" is stored as "Il pi& ugrave; crudele dei giorni". 例如,意大利电影“ Ilpiùrawle dei giorni”存储为“ Il più rawle dei giorni”。

Could someone kindly let me know if there is any solution? 有人可以让我知道是否有解决方案吗? (I know that I can set the language for the crawler, I already crawled movie titles in Italian as well, but when I want to crawl English titles, there are still some movies in Imdb which has non English characters) (我知道我可以设置搜寻器的语言,我也已经用意大利语搜寻过电影标题,但是当我要搜寻英文标题时,Imdb中仍有一些电影包含非英文字符)

EDIT: Here is my code: 编辑:这是我的代码:

String baseUrl = "http://www.imdb.com/search/title?at=0&count=250&sort=num_votes,desc&start="+start+"&title_type=feature&view=simple";

label1:  try {

     org.jsoup.Connection con = Jsoup.connect(baseUrl).userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21").header("Accept-Language", "en");
     con.timeout(30000).ignoreHttpErrors(true).followRedirects(true);
     Response resp = con.execute();
     Document doc = null;

     if (resp.statusCode() == 200) {

         doc = con.get();                                       

         Elements myElements = doc.getElementsByClass("results").first().getElementsByTag("table");
         Elements trs = myElements.select(":not(thead) tr");

         for (int i = 0; i < trs.size(); i++) {

             Element tr = trs.get(i);
             Elements tds = tr.select("td");

             for (int j = 3; j < tds.size(); j++) {

                 Elements links = tds.select("a[href]");
                 String titleId = links.attr("href");
                 String movietitle = links.html();    

                  //I ADDED YOUR CODE HERE
                   Charset c = Charset.forName("UTF-16BE");

                        ByteBuffer b = c.encode(movietitle);
                        for (int m = 0; b.hasRemaining(); m++) {
                            int charValue = (b.get()) & 0xff;
                            System.out.print((char) charValue);
                        }   

               // try{    

                //   String query = "INSERT into test (movieName,ImdbId)" + "VALUES (?,?)";
    //               PreparedStatement preparedStmt = conn.prepareStatement(query);
    //               preparedStmt.setString (1, movietitle);
      //               preparedStmt.setString (2, titleId );
       //          }catch (Exception e)
        //       {
        //           e.printStackTrace();
        //       }

Thanks, 谢谢,

Here, I copy pasted the string shared in the question and tried 在这里,我复制粘贴问题中共享的字符串并尝试

public class Test {
    public static void main (String...a) throws Exception {
        String s = "Il più crudele dei giorni";
        Charset c = Charset.forName("UTF-16BE");

        ByteBuffer b = c.encode(s);
        for (int i = 0; b.hasRemaining(); i++) {
            int charValue = (b.get()) & 0xff;
            System.out.print((char) charValue);
        }
    }
}

This prints the s as it is on the console. 该打印s ,因为它是在控制台上。 I assume that you already have part of code which writes to a file. 我假设您已经有一部分代码可以写入文件。 You can try integrating the above code if it works for you. 如果适合您,可以尝试整合以上代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM