JSoup核心Web文本提取

Question

I am new to JSoup, Sorry if my question is too trivial. 我是JSoup的新手，很抱歉，如果我的问题太琐碎了。 I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document I am not able to see any articles in the parsed output 我正在尝试从http://www.nytimes.com/提取文章文本，但是在打印分析文档时，我无法在分析输出中看到任何文章

public class App 
{

    public static void main( String[] args )
    {
        String url = "http://www.nytimes.com/";
        Document document;
        try {
            document = Jsoup.connect(url).get();

            System.out.println(document.html()); // Articles not getting printed
            //System.out.println(document.toString()); // Same here
            String title = document.title();
            System.out.println("title : " + title); // Title is fine

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

ok I have tried to parse " http://en.wikipedia.org/wiki/Big_data " to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put. 好的，我尝试解析“ http://en.wikipedia.org/wiki/Big_data ”以检索Wiki数据，这里也存在同样的问题，但我没有得到输出的Wiki数据。 Any help or hint will be much appreciated. 任何帮助或提示将不胜感激。

Thanks. 谢谢。

Answer 1

Here's how to get all <p class="summary> text: 以下是获取所有<p class="summary>文本的方法：

final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();

for( Element element : doc.select("p.summary") )
{
    if( element.hasText() ) // Skip those tags without text
    {
        System.out.println(element.text());
    }
}

If you need all <p> tags, without any filtering, you can use doc.select("p") instead. 如果需要所有 <p>标记，而不进行任何过滤，则可以改用doc.select("p") 。 But in most cases it's better to select only those you need (see here for Jsoup Selector documentation). 但是在大多数情况下，最好只选择您需要的那些（请参阅此处以获取Jsoup Selector文档）。

JSoup核心Web文本提取

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-06-21 13:34:15

JSoup核心Web文本提取

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-06-21 13:34:15

解决方案1
0 已采纳 2013-06-21 13:34:15