使用 JAVA 從 HTML 網頁中的 META 標簽中檢索關鍵字

Question

我想使用 Java 從 HTML 網頁中檢索所有內容詞以及同一 HTML 網頁的 META TAG 中包含的所有關鍵字。
例如，考慮這個 html 源代碼：

<html>
<head>
<meta name = "keywords" content = "deception, intricacy, treachery">
</head>
<body>
My very short html document. 
<br>
It has just 2 'lines'.
</body>
</html>

這里的內容詞是： my , very , short , html , document , it , has , just , lines

注意：標點符號和數字“2”被排除在外。

這里的關鍵詞是：欺騙、錯綜復雜、背叛

我為此目的創建了一個名為 WebDoc 的類，這是我所能得到的。

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.Set;
import java.util.TreeSet;

public class WebDoc {

    protected URL _url;
    protected Set<String> _contentWords;
    protected Set<String> _keyWords

    public WebDoc(URL paramURL) {
        _url = paramURL;
    }

    public Set<String> getContents() throws IOException {
        //URL url = new URL(url);
        Set<String> contentWords = new TreeSet<String>();
        BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream()));
        String inputLine;

        while ((inputLine = in.readLine()) != null) {
            // Process each line.
            contentWords.add(RemoveTag(inputLine));
            //System.out.println(RemoveTag(inputLine));
        }
        in.close();
        System.out.println(contentWords);
        _contentWords = contentWords;
        return contentWords;
    }    

    public String RemoveTag(String html) {
        html = html.replaceAll("\\<.*?>","");
        html = html.replaceAll("&","");
        return html;
    }



    public Set<String> getKeywords() {
        //NO IDEA !
        return null;
    }

    public URL getURL() {
        return _url;
    }

    @Override
    public String toString() {
        return null;
    }
}

Answer 1

處理每一行並使用

public Set<String> getKeywords(String str) {
        Set<String> s = new HashSet<String>();
        str = str.trim();
        if (str.toLowerCase().startsWith("<meta ")) {
           if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) {
               // Returns only whats in the content attribute (case-insensitive)
               str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1");
               for (String st:str.split(",")) s.add(st.trim());
               return s;
           }
        }
        return null;
    }

如果您需要解釋，請告訴我。

Answer 2

因此，在 RedSoxFan 關於元關鍵字的回答之后，您只需要拆分您的內容行。 您可以在那里使用類似的方法：

代替

contentWords.add(RemoveTag(inputLine));

用

contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));

.split(...)在所有非字母處拆分您的行（我希望這行得通，請嘗試並報告），返回一組子字符串，每個子字符串應僅包含字母，並在其間包含一些空字符串。
Arrays.asList(...)將此數組包裝在一個列表中。
addAll(...)將此數組的所有元素添加到集合中，但不添加重復項）。

最后，您應該從 contentWords-Set 中刪除空字符串"" 。

使用 JAVA 從 HTML 網頁中的 META 標簽中檢索關鍵字

問題描述

2 個解決方案

解決方案1
1 2011-02-23 23:24:11

解決方案2
1 已采納 2011-02-23 23:57:11

使用 JAVA 從 HTML 網頁中的 META 標簽中檢索關鍵字

問題描述

2 個解決方案

解決方案1 1 2011-02-23 23:24:11

解決方案2 1 已采納 2011-02-23 23:57:11

解決方案1
1 2011-02-23 23:24:11

解決方案2
1 已采納 2011-02-23 23:57:11