简体   繁体   English

使用jsoup获取URL的子链接

[英]Getting sub links of a URL using jsoup

Consider a URl www.example.com it may have plenty numbers of links ,some may be internal and other may be external.I want to get a list of all the sub links ,not even the sub-sub links but only sub link. 考虑一个URl www.example.com,它可能有很多链接,有些可能是内部链接,有些可能是外部链接。我想获得所有子链接的列表,甚至不是子子链接,而只有子链接。 EG if there are four links as follows EG,如果有四个链接如下

1)www.example.com/images/main
2)www.example.com/data
3)www.example.com/users
4)www.example.com/admin/data

Then out of the four only 2 and 3 are of use as they are sub links not the sub-sub and so on links .Is there a way to achieve it through j-soup..If this could not be achieved through j-soup then one can introduce me with some other java API. 然后在四个中只有2个和3个可用,因为它们是子链接而不是sub-sub等链接。是否有一种方法可以通过j-soup实现。如果无法通过j-soup实现然后可以向我介绍其他一些Java API。 Also note that it should be a link of the parent Url which is initially sent(ie www.example.com) 还要注意,它应该是最初发送的父网址的链接(即www.example.com)

If i can understand a sub-link can contain one slash you can attempt with this with counting the number of slashes for example : 如果我能理解一个子链接可以包含一个斜杠,则可以尝试计算该斜杠的数量,例如:

List<String> list = new ArrayList<>();
list.add("www.example.com/images/main");
list.add("www.example.com/data");
list.add("www.example.com/users");
list.add("www.example.com/admin/data");

for(String link : list){
    if((link.length() - link.replaceAll("[/]", "").length()) == 1){
        System.out.println(link);
    }
}

link.length() : count the number of characters link.length() :计算字符数
link.replaceAll("[/]", "").length() : count the number of slashes link.replaceAll("[/]", "").length() :计算斜线数

If the difference equal to one then right link else no. 如果差等于一,则右链接否。


EDIT 编辑

How will i scan the whole website for sub links? 我将如何扫描整个网站的子链接?

The answer for this with the robots.txt file or Robots exclusion standard , so in this it define all the sub-links of the web site for example https://stackoverflow.com/robots.txt , so the idea is, to read this file and you can extract the sub-links from this web-site here is a piece of code that can help you : 可以使用robots.txt文件或Robots排除标准来解决此问题,因此它定义了网站的所有子链接,例如https://stackoverflow.com/robots.txt ,因此该想法是阅读此文件,您可以从此网站提取子链接,此处的一段代码可以为您提供帮助:

public static void main(String[] args) throws Exception {

    //Your web site
    String website = "http://stackoverflow.com";
    //We will read the URL https://stackoverflow.com/robots.txt
    URL url = new URL(website + "/robots.txt");

    //List of your sub-links
    List<String> list;

    //Read the file with BufferedReader
    try (BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()))) {
        String subLink;
        list = new ArrayList<>();

        //Loop throw your file
        while ((subLink = in.readLine()) != null) {

            //Check if the sub-link is match with this regex, if yes then add it to your list
            if (subLink.matches("Disallow: \\/\\w+\\/")) {
                list.add(website + "/" + subLink.replace("Disallow: /", ""));
            }else{
                System.out.println("not match");
            }
        }
    }

    //Print your result
    System.out.println(list);
}

This will show you : 这将向您显示:

[ https://stackoverflow.com/posts/ , https://stackoverflow.com/posts ?, https://stackoverflow.com/search/ , https://stackoverflow.com/search ?, https://stackoverflow.com/feeds/ , https://stackoverflow.com/feeds ?, https://stackoverflow.com/unanswered/ , https://stackoverflow.com/unanswered ?, https://stackoverflow.com/u/ , https://stackoverflow.com/messages/ , https://stackoverflow.com/ajax/ , https://stackoverflow.com/plugins/] [ https://stackoverflow.com/posts/https://stackoverflow.com/posts ,? https://stackoverflow.com/search/https://stackoverflow.com/search ,? https://开头计算器.COM /供稿/https://stackoverflow.com/feeds ?, https://stackoverflow.com/unanswered/https://stackoverflow.com/unanswered ?, https://stackoverflow.com/u/https://stackoverflow.com/messages/https://stackoverflow.com/ajax/https://stackoverflow.com/plugins/]

Here is a Demo about the regex that i use . 这是我使用的正则表达式演示

Hope this can help you. 希望这可以帮到你。

To scan the links on the web page you can use JSoup library. 要扫描网页上的链接,可以使用JSoup库。

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

class read_data {

    public static void main(String[] args) {
        try {
            Document doc = Jsoup.connect("**your_url**").get();
            Elements links = doc.select("a");
            List<String> list = new ArrayList<>();
            for (Element link : links) {
                list.add(link.attr("abs:href"));
            }
        } catch (IOException ex) {

        }
    }
}

list can be used as suggested in the previous answer. 可以按照上一个答案中的建议使用列表


The code for reading all the links on a website is given below. 下面给出了读取网站上所有链接的代码。 I have used http://stackoverflow.com/ for illustration. 我已经使用http://stackoverflow.com/进行了说明。 I would recommend you to go through company's terms of use before scarping it's website. 我建议您先浏览公司的使用条款,然后再删除网站。

import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class readAllLinks {

    public static Set<String> uniqueURL = new HashSet<String>();
    public static String my_site;

    public static void main(String[] args) {

        readAllLinks obj = new readAllLinks();
        my_site = "stackoverflow.com";
        obj.get_links("http://stackoverflow.com/");
    }

    private void get_links(String url) {
        try {
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a");
            links.stream().map((link) -> link.attr("abs:href")).forEachOrdered((this_url) -> {
                boolean add = uniqueURL.add(this_url);
                if (add && this_url.contains(my_site)) {
                    System.out.println(this_url);
                    get_links(this_url);
                }
            });

        } catch (IOException ex) {

        }

    }
}

You will get list of all the links in uniqueURL field. 您将在uniqueURL字段中获得所有链接的列表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM