简体   繁体   English

如何使用JSoup分别从网页的所有元素中提取文本?

[英]How to extract text from all the elements in a webpage individually, using JSoup?

The problem here is, if I do: 这里的问题是,如果我这样做:

Document doc = Jsoup.connect(url)
                        .timeout(30000)
                        .userAgent("Mozilla")
                        .followRedirects(true)
                        .get();
System.out.println(doc.select("body").text());

I get all the text in one chunk, and I don't want that. 我将所有文本打包在一起,但我不想要那样。

Suppose I write a code like this: 假设我编写了这样的代码:

String part="<div>
               Primary div
               <div>
                 Secondary div
               </div>
             </div>";
                Document doc = Jsoup.parse(part);
                Elements links = doc.select("div");
                for(Element e:links){
                    out.println(e.text());
                    System.out.println(e.text());
                }

The output is: 输出为:

Primary div Secondary div
Secondary div

The inner div's text gets scraped twice. 内部div的文本被刮了两次。

I want that the scraping output should be like this: 我希望抓取输出应如下所示:

Primary div
Secondary div

I want the text of each element to be unique excluding the text from the child elements. 我希望每个元素的文本都是唯一的,从子元素中排除该文本。

How can this be achieved? 如何做到这一点? The number of nested children can be more than just one. 嵌套子项的数量可以不止一个。

You aren't getting two copies of Secondary div , you're outputting it twice: Once as part of the output of Primary div , then again on its own. 您没有得到Secondary div两个副本,而是将其输出两次:一次作为Primary div输出的一部分,然后再次独立输出。

If you want just an element's own text and not the text of its child elements, use Element#ownText . 如果只需要元素自己的文本而不是其子元素的文本,请使用Element#ownText

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM