简体   繁体   English

使用jsoup解析HTML:Android和Java之间的差异

[英]Parsing HTML with jsoup: differences between Android and Java

I had problems with jsoup , because I have written the code for parsing some information from the web site in Java and working perfectly. 我遇到了jsoup的问题,因为我已经编写了用于从Java网站解析一些信息并且工作完美的代码。
But I copy the code in Android (encapsulate it in the asyncTask) but the document is different from the doc Java parsing with jsoup.connect() . 但我在Android中复制代码(将其封装在asyncTask中),但该文档与使用jsoup.connect()的doc Java解析不同。
Why? 为什么?

Some code lines are: 一些代码行是:

Document doc = null;
try {
    doc=Jsoup.connect("myurl").timeout(10000).get();
} catch (IOException e) {
    e.printStackTrace();
}

Element body = doc.body();      
Element figlio = body.child(0);     
Elements span_elements = figlio.getElementsByTag("span");

I posted here complete code in java and android. 我在这里发布了java和android的完整代码。

JAVA JAVA

public class MainClass {

    public static void main(String[] args){
            String ProductName = "";
            String Description = "";
            String LongDescription = "";
            String Category = "";

Document doc = null;
        try {
                                                                                     doc=Jsoup.connect("http://eandata.com/lookup/9788820333584/").timeout(10000).get();

        } catch (IOException e) {
            e.printStackTrace();
        }

        Element body = doc.body();

        Element figlio = body.child(0);

        Elements span_elements = figlio.getElementsByTag("span");

        for(Element p : span_elements) {

            if((p.id().compareTo("")) == 0 || p.id() == null) {
                continue;
            }

            else if(p.id().compareTo("upc_prod_product_o") == 0) {
                ProductName = p.text();
                continue;
            }

            else if(p.id().compareTo("upc_prod_description_o") == 0) {
                Description = p.text();
                continue;
            }

            else if(p.id().compareTo("upc_prod_cat_path_o") == 0) {
                Category = p.text();
                continue;
            }

            else if(p.id().compareTo("upc_prod_url_o") == 0) {
                continue;
            }

            else if(p.id().compareTo("upc_prod_long_desc_o") == 0) {
                LongDescription = p.text();
                continue;
            }

        }

        System.out.println(ProductName);
        System.out.println(Description);
        System.out.println(Category);
        System.out.println(LongDescription);

This is instead code ANDROID (i have included the INTERNET PERMISSION in AndroidManifest) ANDROID 这是代码ANDROID(我在AndroidManifest中包含了INTERNET PERMISSION)ANDROID

public class MainActivity extends Activity {

    //Campi necessari per il Parser HTML
        String ProductName = "";
        String Description = "";
        String LongDescription = "";
        String Category = "";

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        HttpHTML task3 = new HttpHTML();
        task3.execute();
    }

    public class HttpHTML extends AsyncTask<Void,Void,Void> {

        @Override
        protected void onPreExecute() {
        }

        @Override
        protected Void doInBackground(Void...params) {
            Document doc = null;
            try {
                 doc=Jsoup.connect("http://eandata.com/lookup/9788820333584/").timeout(10000).get();
            } catch (IOException e) {
                e.printStackTrace();
            }

            //Accedo all'elemento <body> del documento
            Element body = doc.body();
            System.out.println(body.text());

            //Prendo l'elemento figlio del body
            Element figlio = body.child(0);
            System.out.println(figlio.text());

            Elements span_elements = figlio.getElementsByTag("span");

            for(Element p : span_elements) {

                if((p.id().compareTo("")) == 0 || p.id() == null) {
                    continue;
                }

                else if(p.id().compareTo("upc_prod_product_o") == 0) {
                    ProductName = p.text();
                    continue;
                }

                else if(p.id().compareTo("upc_prod_description_o") == 0) {
                    Description = p.text();
                    continue;
                }

                else if(p.id().compareTo("upc_prod_cat_path_o") == 0) {
                    Category = p.text();
                    continue;
                }

                else if(p.id().compareTo("upc_prod_url_o") == 0) {
                    continue;
                }

                else if(p.id().compareTo("upc_prod_long_desc_o") == 0) {
                    LongDescription = p.text();
                    continue;
                }

            }

            System.out.println(ProductName);
            System.out.println(Description);
            System.out.println(Category);
            System.out.println(LongDescription);

            return null;
        }

        @Override
        protected void onProgressUpdate(Void... values) {
        }

        @Override
        protected void onPostExecute(Void result) {

        }

    }




}

Without knowing the URL you're hitting, this is just a guess, but I would bet $5 I'm right: the server is sending back different HTML based on your user-agent string, and because you're not explicitly setting it, it's defaulting. 如果不知道您正在尝试的URL,这只是一个猜测,但我敢打赌$ 5我是对的:服务器根据您的用户代理字符串发回不同的HTML,并且因为您没有明确设置它,这是违约。 And the default between Android and Java is different. Android和Java之间的默认值是不同的。 The server is trying to be helpful and is giving you mobile optimized HTML for Android. 服务器正在尝试提供帮助,并为您提供针对Android的移动优化HTML。

Make sure you specify a user-agent when building your request. 确保在构建请求时指定用户代理。 See the Connection.userAgent() docs for details. 有关详细信息,请参阅Connection.userAgent()文档。 I normally set it to my current browser. 我通常将它设置为我当前的浏览器。

Very interesting problem. 非常有趣的问题。 If you look at the website the interesting part of the information is loaded dynamically. 如果你看一下网站,信息的有趣部分是动态加载的。 Jsoup is not supposed to parse this part. Jsoup不应该解析这部分。 I don't understand why it is work differently on android. 我不明白为什么它在android上的工作方式不同。 But it is not important. 但这并不重要。 I found the url where the interesting information loaded from. 我找到了加载有趣信息的网址。

Try parsing this one. 尝试解析这个。 The added benefit is that it is returned with smaller dataset, it use smaller memory and could be quicker on android. 额外的好处是它返回较小的数据集,它使用较小的内存,可以在Android上更快。

http://eandata.com/lookup.php?extra=x&code=9788820333584&mode=prod&show=&force_amazon=&ajax=1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM