简体   繁体   中英

How to get the last 5 articles from a website with Jsoup

I'm working currently on a java desktop app for a company and they ask me, to extract the 5 last articles from a web page and and to display them in the app. To do this I need a html parser of course and I thought directly about JSoup. But my problem is how do i do it exactly? I found one easy example from this question: Example: How to “scan” a website (or page) for info, and bring it into my program?

with this code:

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

this code was written by BalusC and i understand it, but how do i do it when the links are not fixed, which is the case in most newspaper for example. For the sake of simplicity, how would i go to extract for example the 5 last articles from this news page: News ? I can't use a rss feed as my boss wants the complete articles to be displayed.

First you need to download the main page:

    Document doc = Jsoup.connect("https://globalnews.ca/world/").get();

Then you select links you are interested in for example with css selectors You select all a tags that contains href with text globalnews and are nested in h3 tag with class story-h . Urls are in href attribute of a tag.

    for(Element e: doc.select("h3.story-h > a[href*=globalnews]")) {
        System.out.println(e.attr("href"));
    }

Then the resulting urls you can process as you wish. You can download content of the first five of then using syntax from the first line etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM