简体   繁体   English

如何使用Jsoup从网站上获取最后5篇文章

[英]How to get the last 5 articles from a website with Jsoup

I'm working currently on a java desktop app for a company and they ask me, to extract the 5 last articles from a web page and and to display them in the app. 我目前正在一家公司的Java桌面应用程序上工作,他们要求我从网页上提取最后5篇文章,并将其显示在该应用程序中。 To do this I need a html parser of course and I thought directly about JSoup. 为此,我当然需要一个html解析器,我直接想到了JSoup。 But my problem is how do i do it exactly? 但是我的问题是我该怎么做呢? I found one easy example from this question: Example: How to “scan” a website (or page) for info, and bring it into my program? 我从这个问题中找到了一个简单的示例: 示例:如何“扫描”网站(或页面)以获取信息,并将其带入我的程序中?

with this code: 使用此代码:

package com.stackoverflow.q2835505;

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test {

    public static void main(String[] args) throws Exception {
        String url = "https://stackoverflow.com/questions/2835505";
        Document document = Jsoup.connect(url).get();

        String question = document.select("#question .post-text").text();
        System.out.println("Question: " + question);

        Elements answerers = document.select("#answers .user-details a");
        for (Element answerer : answerers) {
            System.out.println("Answerer: " + answerer.text());
        }
    }

}

this code was written by BalusC and i understand it, but how do i do it when the links are not fixed, which is the case in most newspaper for example. 这段代码是由BalusC编写的,我理解它,但是当链接不固定时我该怎么做,例如大多数报纸就是这种情况。 For the sake of simplicity, how would i go to extract for example the 5 last articles from this news page: News ? 为了简单起见,我将如何从该新闻页面中提取例如最后5篇文章: 新闻 I can't use a rss feed as my boss wants the complete articles to be displayed. 我不能使用rss feed,因为老板希望显示完整的文章。

First you need to download the main page: 首先,您需要下载主页:

    Document doc = Jsoup.connect("https://globalnews.ca/world/").get();

Then you select links you are interested in for example with css selectors You select all a tags that contains href with text globalnews and are nested in h3 tag with class story-h . 然后,例如使用css选择器选择您感兴趣的链接。选择所有包含href和文本globalnews a标签,并嵌套在story-h类的h3标签中。 Urls are in href attribute of a tag. 网址a标签的href属性中。

    for(Element e: doc.select("h3.story-h > a[href*=globalnews]")) {
        System.out.println(e.attr("href"));
    }

Then the resulting urls you can process as you wish. 然后,您可以根据需要处理生成的URL。 You can download content of the first five of then using syntax from the first line etc. 您可以使用第一行中的语法等下载前五项的内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM