简体   繁体   中英

Java/JSoup Plaintext Extraction and Storage

I am trying to solve the following problem.

Assume I have a HTML file that reads:


</div class = nameCouldBeAnything1><br>
    <p>some text here</p><br>
</div>

<div class = nameCouldBeAnything2><br>
    <p>some more text here</p><br>
</div>

<div class = nameCouldBeAnything3><br>
    <p>even more text here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
</div>

What I am trying to achieve is to store the contents in between the div tags into separate string or string array variables.

If there is a Jsoup solution this would be great, if there isn't then a regex string matching starting from p and ending at /p would be great also.

The challenges to take into consideration are:

1) You can not use specific div class names to pinpoint the location of the p tags in order to obtain the plaintext using Jsoup.

2) Using doc.select("body p") or doc.select("div p") from Jsoup kind of works, however when you want to store the p tags into string variables they will be written individually into variables instead of by div into variables.

This is what I have so far:

htmlFile = Jsoup.parse(input, "UTF-8");
Elements body = htmlFile.select("body p");
Element bodyStart = body.first();
Element bodyEnd = body.last(); 
Element p = bodyStart;
int divCount = 0; 

while(p != bodyEnd)
{
    p = body.get(divCount);
    System.out.println(p.text());        
    divCount++;
}

This will get each individual p tag however I want the p tags to stay within their respective divs and store each individual div into string/string array variables.

You need to traverse the document body->div->p instead of body->p .

Elements divs = htmlFile.select("body div");
//initialize div map here
for(Element div : divs) {
    Elements paras = div.getElementsByTag("p");
    for(Element para : paras) {
       String text = para.text();
    }
}

You can store it any data structure based on your requirement while traversing. Hope this helps.

This will put the div-tags that contain a p-tag into a list of strings.

public class Main {
  public static void main(String[] args) throws IOException {
    File html = new File("src/main/resources/markup.html");
    Document doc = Jsoup.parse(html, "UTF-8");
    //all div tags wrapping a p tag
    Elements divs = doc.select("div:has(p)");
    //put the divs into a list
    List<String> list = new ArrayList<String>();
    for (Element div : divs) {
      list.add(div.toString());
      System.out.println(div + "\n");
    }
  }
}

markup.html

<!DOCTYPE html>
<head>
  <meta charset="UTF-8" />
  <title>whatever</title>
</head>

<body>
  <div class=nameCouldBeAnything0>
    <p>some text here</p>
  </div>

  <div class=nameCouldBeAnything1></div>

  <div class=nameCouldBeAnything2>
    <p>some more text here</p>
  </div>

  <div class=nameCouldBeAnything3>
    <p>even more text here</p>
    <p>and here</p>
    <p>and here</p>
    <p>and here</p>
    <p>and here</p>
  </div>

  <div class=nameCouldBeAnything4>
    <span>even more text here</span>
  </div>
</body>
</html>

output

<div class="nameCouldBeAnything0"> 
  <p>some text here</p> 
</div>

<div class="nameCouldBeAnything2"> 
  <p>some more text here</p> 
</div>

<div class="nameCouldBeAnything3"> 
  <p>even more text here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
  <p>and here</p> 
</div>

I was able to solve my dilemma.

This is the code I used, hopefully it helps someone in need.

Thanks to everyone that posted.

public static ArrayList proc(Document htmlFile)
{
    Elements body = htmlFile.select("body");
    ArrayList HTMLPlainText = new ArrayList();

    HTMLPlainText.add(htmlFile.title());

    for(Iterator<Element> it = body.iterator(); it.hasNext();)
    {
        Element pBody = it.next();
        Elements. pTag = pBody.getElementsByTag("p");parents();

            for(int pTagCount = 0; pTagCount < pTag.size(); pTagCount++)
            {
                Element p = pTag.get(pTagCount);
                String pt = p.text();

                if(pt.length() != 0)
                {
                    HTMLPainText.add(pt);
                    pTagCount++:
                }

                pTag.parents().empty();     

            }
    }
}

Note, there may be some syntax errors, I manually typed this in.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM