使用Jsoup的HTML標簽的存在

Question

使用Jsoup，可以很容易地計算出特定標簽在文本中的出現次數。 例如，我試圖查看給定文本中出現多少個錨標記。

    String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
    Document doc = Jsoup.parse(content);
    Elements links = doc.select("a[href]"); // a with href
    System.out.println(links.size());

這樣我得到的計數為4。如果我有一個句子，並且想知道該句子是否包含任何html標簽，那么Jsoup是否可以？ 謝謝。

Answer 1

使用正則表達式可能會更好，但是如果您真的想使用JSoup，則可以嘗試匹配所有元素，然后減去4，因為JSoup自動添加四個元素，即第一個是root元素，然后是然后是<html> ， <head>和<body>元素。

松散地看起來像：

// attempt to count html elements in string - incorrect code, see below 
public static int countHtmlElements(String content) {
    Document doc = Jsoup.parse(content);
    Elements elements = doc.select("*");
    return elements.size()-4;
}

但是，如果文本包含<html> ， <head>或<body> ，則會給出錯誤的結果； 比較以下結果：

// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted 
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));

因此，要使其工作，您將不得不分別檢查“ magic”標簽； 這就是為什么我覺得正則表達式可能更簡單。

嘗試失敗的更多嘗試：使用parseBodyFragment而不是parse沒有幫助，因為JSoup以相同的方式對其進行了清理。 與doc.select("body *"); 可以省去減去4的麻煩，但是如果包含<body> ，它仍然會產生錯誤的計數。 僅當您在應用程序中確定要檢查的字符串中沒有<html> ， <head>或<body>元素時，它才可以在這種限制下工作。

使用Jsoup的HTML標簽的存在

問題描述

1 個解決方案

解決方案1
1 已采納 2013-02-15 22:30:53

使用Jsoup的HTML標簽的存在

問題描述

1 個解決方案

解決方案1 1 已采納 2013-02-15 22:30:53

解決方案1
1 已采納 2013-02-15 22:30:53