JSoup从表td中提取文本，其中不包含任何html节点

Question

I have a html string like this : 我有一个像这样的html字符串：

String html="<table><tbody>
<tr>
<td><p>ABC</p></td>
<td>DEF</td>
</tr>
<tr>
<td><p>GHI</p></td>
<td>MNO</td>
</tr>
</tbody>
</table>";

I only need to extract the text which has no more child elements inside td tags.My current code returns me both text and html nodes. 我只需要提取td标签中没有更多子元素的文本。我当前的代码返回文本和html节点。

Elements elements = doc.select("tbody > tr");
for (Element e : elements) {
    System.out.println(e.select("td").html());
}

But what I need as the out put is : 但我需要的是：

DEF
MNO

Thanks in advance. 提前致谢。

Answer 1

It is not clear to me, if you want to just the the text of each td that is not port of a child of that td , or if you want to further exclude all tds that have children. 我不清楚，如果你想只是每个td的文本不是那个td孩子的端口，或者你想进一步排除所有有孩子的td 。 Therefore you may have to adapt my solution a bit. 因此，您可能需要稍微调整我的解决方案。

String html="<table><tbody>"
        +"<tr>"
        +"<td><p>ABC</p></td>"
        +"<td>DEF</td>"
        +"<td>DEF2<p>ABC</p></td>"
        +"</tr>"
        +"<tr>"
        +"<td><p>GHI</p></td>"
        +"<td>MNO</td>"
        +"<td>MNO2<p>GHI2</p></td>"
        +"</tr>"
        +"</tbody>"
        +"</table>";

Document doc = Jsoup.parse(html);
Elements elements = doc.select("tbody > tr > td:matchesOwn(.+)");
for (Element e : elements) {
    System.out.println(e.text());
}

The above solution looks for td elements that have any own text, ie that match the regular expression .+ (at least one character). 上面的解决方案查找具有任何自己的文本的td元素，即匹配正则表达式.+ （至少一个字符）。

If you want to further weed out the tds that contain children, you can do this: 如果你想进一步淘汰含有孩子的tds，你可以这样做：

Document doc = Jsoup.parse(html);
Elements elements = doc.select("tbody > tr > td:matchesOwn(.+):not(:has(*))");
for (Element e : elements) {
    System.out.println(e.text());
}

This uses both, the :has() and the :not() pseudo selectors as explained in the JSOUP Docs 这使用了:has()和:not()伪选择器，如JSOUP Docs中所述

Answer 2

Try this CSS selector: 试试这个CSS选择器：

tbody > tr > td:not(:has(*))

DEMO DEMO

http://try.jsoup.org/~K4qiK0SxQDeuhE9FvvmUDa3vKKI http://try.jsoup.org/~K4qiK0SxQDeuhE9FvvmUDa3vKKI

DESCRIPTION 描述

tbody  /* Select any tbody */
> tr   /* Select any tr directly under it */
> td   /* Select any td directly under it ... */
:not(:has(*)) /* ... not having any element */

The * operator matches only elements . *运算符仅匹配元素。 A text node is not an element. 文本节点不是元素。 It's just a kind of Node. 它只是一种节点。

SAMPLE CODE 示例代码

Elements elements = doc.select("tbody > tr > td:not(:has(*))");
for (Element e : elements) {
    System.out.println(e.select("td").html());
}

OUTPUT OUTPUT

<td>DEF</td>
<td>MNO</td>

Answer 3

Try Element.child(int index) with index = 0 . 尝试使用index = 0 Element.child(int index) 。

Elements elements = doc.select("tbody > tr");
for (Element e : elements) {
    for (Element el : e.select("td")) {
        // el.child(0)
    }
}

JSoup从表td中提取文本，其中不包含任何html节点

问题描述

3 个解决方案

解决方案1
2 2016-01-20 13:07:37

解决方案2
2 已采纳 2016-01-21 09:35:26

DEMO DEMO

DESCRIPTION 描述

SAMPLE CODE 示例代码

OUTPUT OUTPUT

解决方案3
0 2016-01-20 09:53:50

JSoup从表td中提取文本，其中不包含任何html节点

问题描述

3 个解决方案

解决方案1 2 2016-01-20 13:07:37

解决方案2 2 已采纳 2016-01-21 09:35:26

DEMO DEMO

DESCRIPTION 描述

SAMPLE CODE 示例代码

OUTPUT OUTPUT

解决方案3 0 2016-01-20 09:53:50

解决方案1
2 2016-01-20 13:07:37

解决方案2
2 已采纳 2016-01-21 09:35:26

解决方案3
0 2016-01-20 09:53:50