简体   繁体   中英

Parsing HTML with jsoup: differences between Android and Java

I had problems with jsoup , because I have written the code for parsing some information from the web site in Java and working perfectly.
But I copy the code in Android (encapsulate it in the asyncTask) but the document is different from the doc Java parsing with jsoup.connect() .
Why?

Some code lines are:

Document doc = null;
try {
    doc=Jsoup.connect("myurl").timeout(10000).get();
} catch (IOException e) {
    e.printStackTrace();
}

Element body = doc.body();      
Element figlio = body.child(0);     
Elements span_elements = figlio.getElementsByTag("span");

I posted here complete code in java and android.

JAVA

public class MainClass {

    public static void main(String[] args){
            String ProductName = "";
            String Description = "";
            String LongDescription = "";
            String Category = "";

Document doc = null;
        try {
                                                                                     doc=Jsoup.connect("http://eandata.com/lookup/9788820333584/").timeout(10000).get();

        } catch (IOException e) {
            e.printStackTrace();
        }

        Element body = doc.body();

        Element figlio = body.child(0);

        Elements span_elements = figlio.getElementsByTag("span");

        for(Element p : span_elements) {

            if((p.id().compareTo("")) == 0 || p.id() == null) {
                continue;
            }

            else if(p.id().compareTo("upc_prod_product_o") == 0) {
                ProductName = p.text();
                continue;
            }

            else if(p.id().compareTo("upc_prod_description_o") == 0) {
                Description = p.text();
                continue;
            }

            else if(p.id().compareTo("upc_prod_cat_path_o") == 0) {
                Category = p.text();
                continue;
            }

            else if(p.id().compareTo("upc_prod_url_o") == 0) {
                continue;
            }

            else if(p.id().compareTo("upc_prod_long_desc_o") == 0) {
                LongDescription = p.text();
                continue;
            }

        }

        System.out.println(ProductName);
        System.out.println(Description);
        System.out.println(Category);
        System.out.println(LongDescription);

This is instead code ANDROID (i have included the INTERNET PERMISSION in AndroidManifest) ANDROID

public class MainActivity extends Activity {

    //Campi necessari per il Parser HTML
        String ProductName = "";
        String Description = "";
        String LongDescription = "";
        String Category = "";

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);

        HttpHTML task3 = new HttpHTML();
        task3.execute();
    }

    public class HttpHTML extends AsyncTask<Void,Void,Void> {

        @Override
        protected void onPreExecute() {
        }

        @Override
        protected Void doInBackground(Void...params) {
            Document doc = null;
            try {
                 doc=Jsoup.connect("http://eandata.com/lookup/9788820333584/").timeout(10000).get();
            } catch (IOException e) {
                e.printStackTrace();
            }

            //Accedo all'elemento <body> del documento
            Element body = doc.body();
            System.out.println(body.text());

            //Prendo l'elemento figlio del body
            Element figlio = body.child(0);
            System.out.println(figlio.text());

            Elements span_elements = figlio.getElementsByTag("span");

            for(Element p : span_elements) {

                if((p.id().compareTo("")) == 0 || p.id() == null) {
                    continue;
                }

                else if(p.id().compareTo("upc_prod_product_o") == 0) {
                    ProductName = p.text();
                    continue;
                }

                else if(p.id().compareTo("upc_prod_description_o") == 0) {
                    Description = p.text();
                    continue;
                }

                else if(p.id().compareTo("upc_prod_cat_path_o") == 0) {
                    Category = p.text();
                    continue;
                }

                else if(p.id().compareTo("upc_prod_url_o") == 0) {
                    continue;
                }

                else if(p.id().compareTo("upc_prod_long_desc_o") == 0) {
                    LongDescription = p.text();
                    continue;
                }

            }

            System.out.println(ProductName);
            System.out.println(Description);
            System.out.println(Category);
            System.out.println(LongDescription);

            return null;
        }

        @Override
        protected void onProgressUpdate(Void... values) {
        }

        @Override
        protected void onPostExecute(Void result) {

        }

    }




}

Without knowing the URL you're hitting, this is just a guess, but I would bet $5 I'm right: the server is sending back different HTML based on your user-agent string, and because you're not explicitly setting it, it's defaulting. And the default between Android and Java is different. The server is trying to be helpful and is giving you mobile optimized HTML for Android.

Make sure you specify a user-agent when building your request. See the Connection.userAgent() docs for details. I normally set it to my current browser.

Very interesting problem. If you look at the website the interesting part of the information is loaded dynamically. Jsoup is not supposed to parse this part. I don't understand why it is work differently on android. But it is not important. I found the url where the interesting information loaded from.

Try parsing this one. The added benefit is that it is returned with smaller dataset, it use smaller memory and could be quicker on android.

http://eandata.com/lookup.php?extra=x&code=9788820333584&mode=prod&show=&force_amazon=&ajax=1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM