使用 Java 從 html 中提取錨標簽

Question

我在文本中有幾個錨標簽，

輸入： <a href="http://stackoverflow.com" >Take me to StackOverflow</a>

Output: http://stackoverflow.com

如何在不使用第三方 API 的情況下找到所有這些輸入字符串並將其轉換為 java 中的 output 字符串？

Answer 1

核心 API 中有一些類，您可以使用這些類從錨標簽（如果存在：）中獲取所有href屬性：

import java.io.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class HtmlParseDemo {
   public static void main(String [] args) throws Exception {

       String html =
           "<a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> " +
           "<!--                                                               " +
           "<a href=\"http://ignoreme.com\" >...</a>                           " +
           "-->                                                                " +
           "<a href=\"http://www.google.com\" >Take me to Google</a>           " +
           "<a>NOOOoooo!</a>                                                   ";

       Reader reader = new StringReader(html);
       HTMLEditorKit.Parser parser = new ParserDelegator();
       final List<String> links = new ArrayList<String>();

       parser.parse(reader, new HTMLEditorKit.ParserCallback(){
           public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               if(t == HTML.Tag.A) {
                   Object link = a.getAttribute(HTML.Attribute.HREF);
                   if(link != null) {
                       links.add(String.valueOf(link));
                   }
               }
           }
       }, true);

       reader.close();
       System.out.println(links);
   }
}

這將打印：

[http://stackoverflow.com, http://www.google.com]

Answer 2

public static void main(String[] args) {
    String test = "qazwsx<a href=\"http://stackoverflow.com\">Take me to StackOverflow</a>fdgfdhgfd"
            + "<a href=\"http://stackoverflow2.com\">Take me to StackOverflow2</a>dcgdf";

    String regex = "<a href=(\"[^\"]*\")[^<]*</a>";

    Pattern p = Pattern.compile(regex);

    Matcher m = p.matcher(test);
    System.out.println(m.replaceAll("$1"));
}

注意： Andrzej Doyle 的所有觀點都是有效的，如果您的輸入中包含更多簡單的<a href="X">Y</a> ，並且您確定可以解析 HTML，那么您最好使用 HTML 解析器。

總結一下：

如果您在評論中有<a> ，我發布的正則表達式將不起作用。 （您可以將其視為特殊情況）
如果<a>標記中有其他屬性，則它不起作用。 （再次，您可以將其視為特殊情況）
還有許多其他情況下正則表達式不起作用，你不能用正則表達式覆蓋所有情況，因為 HTML 不是常規語言。

但是，如果您的要求總是用"X"替換<a href="X">Y</a>而不考慮上下文，那么我發布的代碼將起作用。

Answer 3

你可以使用JSoup

String html = "<p>An <a href=\"http://stackoverflow.com\" >Take me to StackOverflow</a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String linkHref = link.attr("href"); // "http://stackoverflow.com"

另見

例子

Answer 4

上面的例子很完美； 如果您想解析 HTML 文檔而不是連接字符串，請編寫類似的內容來補充上面的代碼。

上面的現有代碼~修改為顯示：上面的 HtmlParser.java (HtmlParseDemo.java) 用下面的 HtmlPage.java 補充代碼。 HtmlPage.properties 文件的內容位於此頁面的底部。

HtmlPage.properties文件中main.url屬性為： main.url=http://www.whatever.com/

這樣你就可以解析你之后的 url。 :-) 快樂編碼：-D

import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.HTML;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HtmlParser
{
    public static void main(String[] args) throws Exception
    {
        String html = HtmlPage.getPage();

        Reader reader = new StringReader(html);
        HTMLEditorKit.Parser parser = new ParserDelegator();
        final List<String> links = new ArrayList<String>();

        parser.parse(reader, new HTMLEditorKit.ParserCallback()
        {
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos)
            {
                if (t == HTML.Tag.A)
                {
                    Object link = a.getAttribute(HTML.Attribute.HREF);
                    if (link != null)
                    {
                        links.add(String.valueOf(link));
                    }
                }
            }
        }, true);

        reader.close();

        // create the header
        System.out.println("<html>\n<head>\n   <title>Link City</title>\n</head>\n<body>");

        // spit out the links and create href
        for (String l : links)
        {
            System.out.print("   <a href=\"" + l + "\">" + l + "</a>\n");
        }

        // create footer
        System.out.println("</body>\n</html>");
    }
}

import java.io.BufferedReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ResourceBundle;

public class HtmlPage
{
    public static String getPage()
    {
        StringWriter sw = new StringWriter();
        ResourceBundle bundle = ResourceBundle.getBundle(HtmlPage.class.getName().toString());

        try
        {
            URL url = new URL(bundle.getString("main.url"));

            HttpURLConnection connection = (HttpURLConnection) url.openConnection();
            connection.setRequestMethod("GET");
            connection.setDoOutput(true);

            InputStream content = (InputStream) connection.getInputStream();
            BufferedReader in = new BufferedReader(new InputStreamReader(content));

            String line;

            while ((line = in.readLine()) != null)
            {
                sw.append(line).append("\n");
            }

        } catch (Exception e)
        {
            e.printStackTrace();
        }

        return sw.getBuffer().toString();
    }
}

例如，如果在瀏覽器中查看，這將是來自http://ebay.com.au/的 output 鏈接。 這是一個子集，因為有很多鏈接

    
    
       Link City
    
    
       #mainContent
       http://realestate.ebay.com.au/

Answer 5

如果您需要在不使用 3d 方庫的情況下構建它，最可靠的方法（正如已經建議的那樣）是使用正則表達式（java.util.regexp）。

The alternative is to parse the html as XML, either using a SAX parser to capture and handle each instance of an "a" element or as a DOM Document and then searching it using XPATH (see http://download.oracle.com/ javase/6/docs/api/javax/xml/parsers/package-summary.html ）。 This is problematic though, since it requires the HTML page to be fully XML compliant in markup, a very dangerous assumption and not an approach I would recommend since most "real" html pages are not XML compliant.

盡管如此，我還是建議您查看為此目的構建的現有框架（如 JSoup，上面也提到過）。 無需重新發明輪子。

使用 Java 從 html 中提取錨標簽

問題描述

5 個解決方案

解決方案1
6 2011-07-11 10:00:16

解決方案2
4 已采納 2011-07-11 09:00:15

解決方案3
3 2011-07-11 08:38:59

解決方案4
2 2012-08-27 10:52:20

解決方案5
0 2011-07-11 09:16:57

使用 Java 從 html 中提取錨標簽

問題描述

5 個解決方案

解決方案1 6 2011-07-11 10:00:16

解決方案2 4 已采納 2011-07-11 09:00:15

解決方案3 3 2011-07-11 08:38:59

解決方案4 2 2012-08-27 10:52:20

解決方案5 0 2011-07-11 09:16:57

解決方案1
6 2011-07-11 10:00:16

解決方案2
4 已采納 2011-07-11 09:00:15

解決方案3
3 2011-07-11 08:38:59

解決方案4
2 2012-08-27 10:52:20

解決方案5
0 2011-07-11 09:16:57