简体   繁体   English

使用 Jsoup 获取数据

[英]Getting data in order with Jsoup

I'm trying to get data from html in order from a web.我正在尝试从网络中按顺序从 html 获取数据。 Html code looks like: Html 代码如下所示:

 <div class="text"> First Text <br> <br> <div style="margin:20px; margin-top:5px; "> <table cellpadding="5"> <tbody><tr> <td class="alt2"> <div> Written by <b>excedent</b> </div> <div style="font-style:italic">quote message</div> </td> </tr> </tbody></table> </div>Second Text<br> <br> <img class="img" src="https://developer.android.com/_static/images/android/touchicon-180.png"><br> <br> Third Text </div>

What I want to do is create an Android layout scraping html, but I need to preserve the order of the elements.我想做的是创建一个抓取 html 的 Android 布局,但我需要保留元素的顺序。 In this case:在这种情况下:

  1. TextView => First Text TextView => 第一个文本
  2. TextView => Quote Message TextView => 报价信息
  3. TextView => Second Text TextView => 第二个文本
  4. ImageView => img图像视图 => img
  5. TextView => Third Text TextView => 第三个文本

The problem comes when I try to get html values in order, using JSoup I get a String with "First Text Second Text Third Text" with Element.ownText, an then img at the end, resulting:当我尝试按顺序获取 html 值时,问题就出现了,使用 JSoup 我得到一个带有“第一个文本第二个文本第三个文本”的字符串和 Element.ownText,然后是 img 最后,结果:

  1. TextView => First Text Second Text Third Text TextView => 第一个文本 第二个文本 第三个文本
  2. TextView => Quote Message TextView => 报价信息
  3. ImageView => img图像视图 => img

What can I do to get that data in order?我该怎么做才能按顺序获取这些数据?

Thanks in advance提前致谢

You can parse the html into a list of html nodes.您可以将 html 解析为 html 节点列表。 The list of nodes will preserve the DOM order and give what you want.节点列表将保留 DOM 顺序并提供您想要的。

Check the parseFragment method :检查parseFragment方法:

This method will give you a list of nodes.此方法将为您提供节点列表。

Try this.试试这个。

    String html = ""
        + "<div class=\"text\">"
        + "    First Text"
        + "    <br>"
        + "    <br>"
        + "    <div style=\"margin:20px; margin-top:5px; \">"
        + "    <table cellpadding=\"5\">"
        + "        <tbody><tr>"
        + "            <td class=\"alt2\">"
        + "                <div>"
        + "                    Written by <b>excedent</b>"
        + "                </div>"
        + "                <div style=\"font-style:italic\">quote message</div>"
        + "            </td>"
        + "            </tr></tbody>"
        + "    </table>"
        + "    </div>Second Text<br>"
        + "        <br>"
        + "        <img class=\"img\" src=\"https://developer.android.com/_static/images/android/touchicon-180.png\"><br>"
        + "        <br>"
        + "        Third Text"
        + "    </div>";
    Document doc = Jsoup.parse(html);
    List<String> rootTexts = doc.select("div.text").first().textNodes().stream()
        .map(node -> node.text().trim())
        .filter(s -> !s.isEmpty())
        .collect(Collectors.toList());
    System.out.println(rootTexts);

OUTPUT:输出:

[First Text, Second Text, Third Text]

This answer is a little late, but the correct way to do what you want to do is this.这个答案有点晚了,但是做你想做的事情的正确方法是这样的。 For your outermost <div> , instead of getting the child elements using Element.children() , you'll want to use Element.childNodes() instead.对于最外面的<div> ,不是使用Element.children()获取子元素,而是要使用Element.childNodes()

Element.children() only returns child Elements , in which text is not included. Element.children()只返回子Elements ,其中不包含文本。

Element.childNodes() returns all child nodes, which includes TextNodes and Elements . Element.childNodes()返回所有子节点,包括TextNodesElements

This solution works for me.这个解决方案对我有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM