简体   繁体   中英

How to get href using jsoup

I have some url. I want to get all href's from the html url is pointing to and all href from all gotten hrefs(recursively). The point is I want to set depth of that "recursion" For example, if depth = 1, I need only href's from the HTML. If depth = 2, I need hrefs from HTML(that make suppose list1) and hrefs from each of href from list1 and so on

Here is what I have using jsoup:

import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.File;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.List;

public class Parser {
    private final static String FILE_PATH = "src/main/resources/href.txt";
    private List<String> result;

    private int currentDepth;
    private int maxDepth;

    public Parser(int maxDepth) {
        result = new ArrayList<String>();
        this.maxDepth = maxDepth;
    }

    public void parseURL(String url) throws IOException {
        url = url.toLowerCase();
        if (!result.contains(url)) {
            Connection connection = Jsoup.connect(url);
            Document document = connection.get();
            Elements links = document.select("a[href]");
            for (Element link : links) {
                String href = link.attr("href");
                result.add(href);
                parseURL(link.absUrl("href"));
                currentDepth++;
                if (currentDepth == maxDepth)
                    return;
            }
        }
    }
}

How should I fix recursion condition to make it right?

I think you should check the depth first before calling the recursive function.

if (currentDepth >= maxDepth){
    // do nothing
}else{
    parseURL(...)
}
  public void parseURL(String url) throws IOException {
    url = url.toLowerCase();
    if (!result.contains(url)) {
        Connection connection = Jsoup.connect(url);
        Document document = connection.get();
        Elements links = document.getElementsByAttribute("href");
       // Elements links = document.select("a[href]");
        for (Element link : links) {
            String href = link.attr("href");
            result.add(href);
            parseURL(link.absUrl("href"));
            currentDepth++;
            if (currentDepth == maxDepth)
                return;
        }
    }
}

You can try this in your code, you can get all Elements from method getElementsByAttribute(String attribute) which have specified attribute

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM