简体   繁体   中英

jsoup for unstructured html page with table

i'm trying to get the main img from this url , here what i tried so far :

Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (IOException e) {
        e.printStackTrace();
    }

    Element table = doc.select("center").get(1);
    Elements rows = table.select("table[width=970]");
    for (int i = 0; i < rows.size(); i++) {
        Element row = rows.get(1);
        Elements cols = row.select("table[width=634]");
        for (int j = 0; j < cols.size(); j++) {
            Element row1 = rows.get(1);
            Elements cols1 = row1.select("table[width=600]");
            for (int k = 0; k < cols1.size(); k++){
                Element row0 = rows.first();
                Elements cols0 = row0.select("td");
                for (Element image : cols0) {
                    String image2 = image.absUrl("src").toString();
                    Log.i("tanja7 ", "pic  " + image2);
                }
            }
        }
    }

this is the unstructured html page (i don't know how to copy the html code) 在此处输入图片说明 What i'm doing wrong?

It seems that you are expecting the inner elements as result of a JSoup select method call. That is not right - you get the elements that match the selector within the "search scope", which is given by the Element(s)/document class instance from which you call select . So, if you want to get all table elements of the document you do doc.select("table") . This gives you not the rows, but the tables. Maybe you did understand this before, but your variable naming suggests otherwise.

Anyway, here is a selector that works. It will get all img elements that are (not necessarily direct) children of a table that has the attribute width=600 and is within another table of the document.

Elements imgEls = doc.select("table table[width=600] img");
System.out.println(imgEls.first().absUrl("src"));

You say the html is not structured, so you might want to check if the relevant images are really always inside two tables as specified.

update: if you are using a mobile device make sure to add:

doc = Jsoup.connect(url).userAgent("Mozilla").get();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM