如何使用Jsoup从网站上获取最后5篇文章

Question

I'm working currently on a java desktop app for a company and they ask me, to extract the 5 last articles from a web page and and to display them in the app. 我目前正在一家公司的Java桌面应用程序上工作，他们要求我从网页上提取最后5篇文章，并将其显示在该应用程序中。 To do this I need a html parser of course and I thought directly about JSoup. 为此，我当然需要一个html解析器，我直接想到了JSoup。 But my problem is how do i do it exactly? 但是我的问题是我该怎么做呢？ I found one easy example from this question: Example: How to “scan” a website (or page) for info, and bring it into my program? 我从这个问题中找到了一个简单的示例：示例：如何“扫描”网站（或页面）以获取信息，并将其带入我的程序中？

with this code: 使用此代码：

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

this code was written by BalusC and i understand it, but how do i do it when the links are not fixed, which is the case in most newspaper for example. 这段代码是由BalusC编写的，我理解它，但是当链接不固定时我该怎么做，例如大多数报纸就是这种情况。 For the sake of simplicity, how would i go to extract for example the 5 last articles from this news page: News ? 为了简单起见，我将如何从该新闻页面中提取例如最后5篇文章：新闻？ I can't use a rss feed as my boss wants the complete articles to be displayed. 我不能使用rss feed，因为老板希望显示完整的文章。

Answer 1

First you need to download the main page: 首先，您需要下载主页：

    Document doc = Jsoup.connect("https://globalnews.ca/world/").get();

Then you select links you are interested in for example with css selectors You select all a tags that contains href with text globalnews and are nested in h3 tag with class story-h . 然后，例如使用css选择器选择您感兴趣的链接。选择所有包含href和文本globalnews a标签，并嵌套在story-h类的h3标签中。 Urls are in href attribute of a tag. 网址a标签的href属性中。

    for(Element e: doc.select("h3.story-h > a[href*=globalnews]")) {
        System.out.println(e.attr("href"));
    }

Then the resulting urls you can process as you wish. 然后，您可以根据需要处理生成的URL。 You can download content of the first five of then using syntax from the first line etc. 您可以使用第一行中的语法等下载前五项的内容。

如何使用Jsoup从网站上获取最后5篇文章

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-03-29 08:07:59

如何使用Jsoup从网站上获取最后5篇文章

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-03-29 08:07:59

解决方案1
0 已采纳 2018-03-29 08:07:59