使用jsoup解壓縮https網址

Question

我有以下代碼使用jsoup從給定頁面中提取URL。

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class ListLinks {
    public static void main(String[] args) throws IOException {

        String url = "http://shopping.yahoo.com";
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.getElementsByTag("a");


        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
       print(" * a: <%s>  (%s)", link.absUrl("href") /*link.attr("href")*/, trim(link.text(), 35));     
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}

我正在嘗試做的是構建一個只提取https網站的抓取工具。 我給抓取器一個種子鏈接開始，然后它應該提取所有https站點，然后獲取每個提取的鏈接並對它們執行相同操作，直到達到一定數量的已收集URL。

我的問題：上面的代碼可以提取給定頁面中的所有鏈接。 我需要提取僅以https://開頭的鏈接，為了實現這一點，我需要做什么？

Answer 1

您可以使用jsoup選擇器。 它們非常強大。

doc.select("a[href*=https]");//(This is the one you are looking for)selects if value of href contatins https
doc.select("a[href^=www]");//selects if value of href starts with www
doc.select("a[href$=.com]");//selects if value of href ends with .com.

等等。試驗一下，你會找到正確的。

使用jsoup解壓縮https網址

問題描述

1 個解決方案

解決方案1
2 已采納 2012-07-05 05:30:24

使用jsoup解壓縮https網址

問題描述

1 個解決方案

解決方案1 2 已采納 2012-07-05 05:30:24

解決方案1
2 已采納 2012-07-05 05:30:24