简体   繁体   English

检索jsoup中元素的文本

[英]retrieving the text of an element in jsoup

When I was using jsoup to parse some html files like "google.com" I encountered with a problem in retreiving the text of an element. 当我使用jsoup解析一些像“google.com”这样的html文件时,我遇到了一个检索元素文本的问题。

For example in this div element using the text function, the words "Programs" and "Business" are attached to each other which I think it's not right: 例如,在使用text功能的div元素中,单词“Programs”和“Business”彼此相连,我认为这是不对的:

<div id="fll" style="margin:19px auto;text-align:center">
   <a href="/intl/en/ads/">Advertising&nbsp;Programs</a>
   <a href="/services/">Business Solutions</a>
   <a href="https://plus.google.com/" rel="publisher">+Google</a>
   <a href="/intl/en/about.html">About Google</a>
</div>

You can test my claim with this code: 您可以使用以下代码测试我的声明:

URL url = new URL("http://www.google.com");
Document document = Jsoup.parse(url, 10000);
Element element = document.select("div[id=fll]").first();
System.out.println(element.text());

Output will be: 输出将是:

Advertising ProgramsBusiness Solutions+GoogleAbout Google

I want to know that can anything to be done about it? 我想知道有什么可以做的吗?

By the way I traced the code and found out that the problem will be corrected by adding this line: 顺便说一下,我跟踪了代码,发现可以通过添加以下行来纠正问题:

textNode.text(textNode.text() + " ");

between the lines 755 and 756 of the Element class of the nodes package of the jsoup source code. jsoup源代码的nodes包的Element类的第755和756行之间。

Also this problem exists in Elements class of the select package and probably in other text functions! 此问题也存在于select包的Elements类中,可能还存在于其他text函数中!

The text() method in jsoup returns only the text in an element. jsoup中的text()方法仅返回元素中的文本。 In your example, your element is a div . 在您的示例中,您的元素是div When calling the text() method on it, all of the tags are essentially removed and the text remains. 当对其调用text()方法时,基本上删除了所有标记并保留了文本。 Since Programs doesn't have any space after it, it looks as though it slides right up on Business, which in this case is correct behavior. 由于程序后面没有任何空格,它看起来好像在Business上滑动,在这种情况下是正确的行为。

If you want the text separately, you can do something like this (untested code): 如果您想单独使用文本,可以执行以下操作(未经测试的代码):

for (Element a : div.select("a")) {
     System.out.println(a.text());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM