简体   繁体   中英

JSoup core web text extraction

I am new to JSoup, Sorry if my question is too trivial. I am trying to extract article text from http://www.nytimes.com/ but on printing the parse document I am not able to see any articles in the parsed output

public class App 
{

    public static void main( String[] args )
    {
        String url = "http://www.nytimes.com/";
        Document document;
        try {
            document = Jsoup.connect(url).get();

            System.out.println(document.html()); // Articles not getting printed
            //System.out.println(document.toString()); // Same here
            String title = document.title();
            System.out.println("title : " + title); // Title is fine

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

} 

ok I have tried to parse " http://en.wikipedia.org/wiki/Big_data " to retrieve the wiki data, same issue here as well I am not getting the wiki data in the out put. Any help or hint will be much appreciated.

Thanks.

Here's how to get all <p class="summary> text:

final String url = "http://www.nytimes.com/";
Document doc = Jsoup.connect(url).get();

for( Element element : doc.select("p.summary") )
{
    if( element.hasText() ) // Skip those tags without text
    {
        System.out.println(element.text());
    }
}

If you need all <p> tags, without any filtering, you can use doc.select("p") instead. But in most cases it's better to select only those you need (see here for Jsoup Selector documentation).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM