简体   繁体   English

JSoup从表td中提取文本,其中不包含任何html节点

[英]JSoup extract the text from the table td which doesnt contains any html nodes

I have a html string like this : 我有一个像这样的html字符串:

String html="<table><tbody>
<tr>
<td><p>ABC</p></td>
<td>DEF</td>
</tr>
<tr>
<td><p>GHI</p></td>
<td>MNO</td>
</tr>
</tbody>
</table>";

I only need to extract the text which has no more child elements inside td tags.My current code returns me both text and html nodes. 我只需要提取td标签中没有更多子元素的文本。我当前的代码返回文本和html节点。

Elements elements = doc.select("tbody > tr");
for (Element e : elements) {
    System.out.println(e.select("td").html());
}

But what I need as the out put is : 但我需要的是:

DEF
MNO

Thanks in advance. 提前致谢。

It is not clear to me, if you want to just the the text of each td that is not port of a child of that td , or if you want to further exclude all tds that have children. 我不清楚,如果你想只是每个td的文本不是那个td孩子的端口,或者你想进一步排除所有有孩子的td Therefore you may have to adapt my solution a bit. 因此,您可能需要稍微调整我的解决方案。

String html="<table><tbody>"
        +"<tr>"
        +"<td><p>ABC</p></td>"
        +"<td>DEF</td>"
        +"<td>DEF2<p>ABC</p></td>"
        +"</tr>"
        +"<tr>"
        +"<td><p>GHI</p></td>"
        +"<td>MNO</td>"
        +"<td>MNO2<p>GHI2</p></td>"
        +"</tr>"
        +"</tbody>"
        +"</table>";

Document doc = Jsoup.parse(html);
Elements elements = doc.select("tbody > tr > td:matchesOwn(.+)");
for (Element e : elements) {
    System.out.println(e.text());
}

The above solution looks for td elements that have any own text, ie that match the regular expression .+ (at least one character). 上面的解决方案查找具有任何自己的文本的td元素,即匹配正则表达式.+ (至少一个字符)。

If you want to further weed out the tds that contain children, you can do this: 如果你想进一步淘汰含有孩子的tds,你可以这样做:

Document doc = Jsoup.parse(html);
Elements elements = doc.select("tbody > tr > td:matchesOwn(.+):not(:has(*))");
for (Element e : elements) {
    System.out.println(e.text());
}

This uses both, the :has() and the :not() pseudo selectors as explained in the JSOUP Docs 这使用了:has():not()伪选择器,如JSOUP Docs中所述

Try this CSS selector: 试试这个CSS选择器:

tbody > tr > td:not(:has(*))

DEMO DEMO

http://try.jsoup.org/~K4qiK0SxQDeuhE9FvvmUDa3vKKI http://try.jsoup.org/~K4qiK0SxQDeuhE9FvvmUDa3vKKI

DESCRIPTION 描述

tbody  /* Select any tbody */
> tr   /* Select any tr directly under it */
> td   /* Select any td directly under it ... */
:not(:has(*)) /* ... not having any element */

The * operator matches only elements . *运算符仅匹配元素 A text node is not an element. 文本节点不是元素。 It's just a kind of Node. 它只是一种节点。

SAMPLE CODE 示例代码

Elements elements = doc.select("tbody > tr > td:not(:has(*))");
for (Element e : elements) {
    System.out.println(e.select("td").html());
}

OUTPUT OUTPUT

<td>DEF</td>
<td>MNO</td>

Try Element.child(int index) with index = 0 . 尝试使用index = 0 Element.child(int index)

Elements elements = doc.select("tbody > tr");
for (Element e : elements) {
    for (Element el : e.select("td")) {
        // el.child(0)
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM