JSoup core web text extraction

Question

I am new to JSoup, Sorry if my question is too trivial. I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document I am not able to see any articles in the parsed output

public class App 
{

    public static void main( String[] args )
    {
        String url = "http://www.nytimes.com/";
        Document document;
        try {
            document = Jsoup.connect(url).get();

            System.out.println(document.html()); // Articles not getting printed
            //System.out.println(document.toString()); // Same here
            String title = document.title();
            System.out.println("title : " + title); // Title is fine

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

}

ok I have tried to parse " http://en.wikipedia.org/wiki/Big_data " to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put. Any help or hint will be much appreciated.

Thanks.

Answer 1

Here's how to get all <p class="summary> text:

final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();

for( Element element : doc.select("p.summary") )
{
    if( element.hasText() ) // Skip those tags without text
    {
        System.out.println(element.text());
    }
}

If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

JSoup core web text extraction

Question

1 answers

solution1
0 ACCPTED 2013-06-21 13:34:15

JSoup core web text extraction

Question

1 answers

solution1 0 ACCPTED 2013-06-21 13:34:15

solution1
0 ACCPTED 2013-06-21 13:34:15