Java / JSoup純文本提取和存儲

Question

我正在嘗試解決以下問題。

假設我有一個HTML文件，內容為：

</div class = nameCouldBeAnything1><br>
    <p>some text here</p><br>
</div>

<div class = nameCouldBeAnything2><br>
    <p>some more text here</p><br>
</div>

<div class = nameCouldBeAnything3><br>
    <p>even more text here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
</div>

我想要實現的是將div標記之間的內容存儲到單獨的字符串或字符串數組變量中。

如果有一個Jsoup解決方案，那就太好了；如果沒有，那么從p開始到/ p結束的正則表達式字符串匹配也就很棒。

要考慮的挑戰是：

1）您不能使用特定的div類名稱來精確定位p標簽的位置，以便使用Jsoup獲取純文本。

2）使用doc.select("body p")或doc.select("div p") ，但是，當您要將p標簽存儲到字符串變量中時，它們將被分別寫入變量中，而不是通過div成變量。

這是我到目前為止的內容：

htmlFile = Jsoup.parse(input, "UTF-8");
Elements body = htmlFile.select("body p");
Element bodyStart = body.first();
Element bodyEnd = body.last(); 
Element p = bodyStart;
int divCount = 0; 

while(p != bodyEnd)
{
    p = body.get(divCount);
    System.out.println(p.text());        
    divCount++;
}

這將獲取每個單獨的p標簽，但是我希望p標簽保留在它們各自的div中，並將每個單獨的div存儲到字符串/字符串數組變量中。

Answer 1

您需要遍歷文檔body->div->p而不是body->p 。

Elements divs = htmlFile.select("body div");
//initialize div map here
for(Element div : divs) {
    Elements paras = div.getElementsByTag("p");
    for(Element para : paras) {
       String text = para.text();
    }
}

遍歷時可以根據需要將其存儲為任何數據結構。 希望這可以幫助。

Answer 2

這會將包含p標簽的div標簽放入字符串列表中。

public class Main {
  public static void main(String[] args) throws IOException {
    File html = new File("src/main/resources/markup.html");
    Document doc = Jsoup.parse(html, "UTF-8");
    //all div tags wrapping a p tag
    Elements divs = doc.select("div:has(p)");
    //put the divs into a list
    List<String> list = new ArrayList<String>();
    for (Element div : divs) {
      list.add(div.toString());
      System.out.println(div + "\n");
    }
  }
}

markup.html

<!DOCTYPE html>
<head>
  <meta charset="UTF-8" />
  <title>whatever</title>
</head>

<body>
  <div class=nameCouldBeAnything0>
    <p>some text here</p>
  </div>

  <div class=nameCouldBeAnything1></div>

  <div class=nameCouldBeAnything2>
    <p>some more text here</p>
  </div>

  <div class=nameCouldBeAnything3>
    <p>even more text here</p>
    <p>and here</p>
    <p>and here</p>
    <p>and here</p>
    <p>and here</p>
  </div>

  <div class=nameCouldBeAnything4>
    <span>even more text here</span>
  </div>
</body>
</html>

輸出

<div class="nameCouldBeAnything0"> 
  <p>some text here</p> 
</div>

<div class="nameCouldBeAnything2"> 
  <p>some more text here</p> 
</div>

<div class="nameCouldBeAnything3"> 
  <p>even more text here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
</div>

Answer 3

我能夠解決自己的困境。

這是我使用的代碼，希望它可以幫助需要幫助的人。

感謝所有張貼的人。

public static ArrayList proc(Document htmlFile)
{
    Elements body = htmlFile.select("body");
    ArrayList HTMLPlainText = new ArrayList();

    HTMLPlainText.add(htmlFile.title());

    for(Iterator<Element> it = body.iterator(); it.hasNext();)
    {
        Element pBody = it.next();
        Elements. pTag = pBody.getElementsByTag("p");parents();

            for(int pTagCount = 0; pTagCount < pTag.size(); pTagCount++)
            {
                Element p = pTag.get(pTagCount);
                String pt = p.text();

                if(pt.length() != 0)
                {
                    HTMLPainText.add(pt);
                    pTagCount++:
                }

                pTag.parents().empty();     

            }
    }
}

注意，可能有一些語法錯誤，我手動輸入了此錯誤。

Java / JSoup純文本提取和存儲

問題描述

3 個解決方案

解決方案1
0 2012-08-21 09:23:42

解決方案2
0 2012-08-21 09:46:45

解決方案3
0 已采納 2012-08-28 07:51:20

Java / JSoup純文本提取和存儲

問題描述

3 個解決方案

解決方案1 0 2012-08-21 09:23:42

解決方案2 0 2012-08-21 09:46:45

解決方案3 0 已采納 2012-08-28 07:51:20

解決方案1
0 2012-08-21 09:23:42

解決方案2
0 2012-08-21 09:46:45

解決方案3
0 已采納 2012-08-28 07:51:20