简体   繁体   English

带表的非结构化html页面的jsoup

[英]jsoup for unstructured html page with table

i'm trying to get the main img from this url , here what i tried so far : 我正在尝试从此URL获取主要img,这是到目前为止我尝试过的:

Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (IOException e) {
        e.printStackTrace();
    }

    Element table = doc.select("center").get(1);
    Elements rows = table.select("table[width=970]");
    for (int i = 0; i < rows.size(); i++) {
        Element row = rows.get(1);
        Elements cols = row.select("table[width=634]");
        for (int j = 0; j < cols.size(); j++) {
            Element row1 = rows.get(1);
            Elements cols1 = row1.select("table[width=600]");
            for (int k = 0; k < cols1.size(); k++){
                Element row0 = rows.first();
                Elements cols0 = row0.select("td");
                for (Element image : cols0) {
                    String image2 = image.absUrl("src").toString();
                    Log.i("tanja7 ", "pic  " + image2);
                }
            }
        }
    }

this is the unstructured html page (i don't know how to copy the html code) 这是非结构化的html页面(我不知道如何复制html代码) 在此处输入图片说明 What i'm doing wrong? 我做错了什么?

It seems that you are expecting the inner elements as result of a JSoup select method call. 似乎您期望内部元素是JSoup select方法调用的结果。 That is not right - you get the elements that match the selector within the "search scope", which is given by the Element(s)/document class instance from which you call select . 那是不对的-您会在“搜索范围”内获得与选择器匹配的元素,该范围由调用select的Element / s类实例给出。 So, if you want to get all table elements of the document you do doc.select("table") . 因此,如果要获取文档的所有表元素,请执行doc.select("table") This gives you not the rows, but the tables. 这给您的不是行,而是表。 Maybe you did understand this before, but your variable naming suggests otherwise. 也许您以前确实了解过,但是您的变量命名却暗示了其他情况。

Anyway, here is a selector that works. 无论如何,这是一个有效的选择器。 It will get all img elements that are (not necessarily direct) children of a table that has the attribute width=600 and is within another table of the document. 它将获得所有img元素,这些元素是(不一定是直接的)表的子元素,该属性的width=600 ,并且在文档的另一个表中。

Elements imgEls = doc.select("table table[width=600] img");
System.out.println(imgEls.first().absUrl("src"));

You say the html is not structured, so you might want to check if the relevant images are really always inside two tables as specified. 您说html不是结构化的,因此您可能要检查相关图像是否确实始终位于指定的两个表中。

update: if you are using a mobile device make sure to add: 更新:如果您使用的是移动设备,请确保添加:

doc = Jsoup.connect(url).userAgent("Mozilla").get();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM