[英]Java/JSoup Plaintext Extraction and Storage
I am trying to solve the following problem. 我正在尝试解决以下问题。
Assume I have a HTML file that reads: 假设我有一个HTML文件,内容为:
</div class = nameCouldBeAnything1><br>
<p>some text here</p><br>
</div>
<div class = nameCouldBeAnything2><br>
<p>some more text here</p><br>
</div>
<div class = nameCouldBeAnything3><br>
<p>even more text here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
<p>and here</p><br>
</div>
What I am trying to achieve is to store the contents in between the div tags into separate string or string array variables. 我想要实现的是将div标记之间的内容存储到单独的字符串或字符串数组变量中。
If there is a Jsoup solution this would be great, if there isn't then a regex string matching starting from p and ending at /p would be great also. 如果有一个Jsoup解决方案,那就太好了;如果没有,那么从p开始到/ p结束的正则表达式字符串匹配也就很棒。
The challenges to take into consideration are: 要考虑的挑战是:
1) You can not use specific div class names to pinpoint the location of the p tags in order to obtain the plaintext using Jsoup. 1)您不能使用特定的div类名称来精确定位p标签的位置,以便使用Jsoup获取纯文本。
2) Using doc.select("body p")
or doc.select("div p")
from Jsoup kind of works, however when you want to store the p tags into string variables they will be written individually into variables instead of by div into variables. 2)使用
doc.select("body p")
或doc.select("div p")
,但是,当您要将p标签存储到字符串变量中时,它们将被分别写入变量中,而不是通过div成变量。
This is what I have so far: 这是我到目前为止的内容:
htmlFile = Jsoup.parse(input, "UTF-8");
Elements body = htmlFile.select("body p");
Element bodyStart = body.first();
Element bodyEnd = body.last();
Element p = bodyStart;
int divCount = 0;
while(p != bodyEnd)
{
p = body.get(divCount);
System.out.println(p.text());
divCount++;
}
This will get each individual p tag however I want the p tags to stay within their respective divs and store each individual div into string/string array variables. 这将获取每个单独的p标签,但是我希望p标签保留在它们各自的div中,并将每个单独的div存储到字符串/字符串数组变量中。
You need to traverse the document body->div->p
instead of body->p
. 您需要遍历文档
body->div->p
而不是body->p
。
Elements divs = htmlFile.select("body div");
//initialize div map here
for(Element div : divs) {
Elements paras = div.getElementsByTag("p");
for(Element para : paras) {
String text = para.text();
}
}
You can store it any data structure based on your requirement while traversing. 遍历时可以根据需要将其存储为任何数据结构。 Hope this helps.
希望这可以帮助。
This will put the div-tags that contain a p-tag into a list of strings. 这会将包含p标签的div标签放入字符串列表中。
public class Main {
public static void main(String[] args) throws IOException {
File html = new File("src/main/resources/markup.html");
Document doc = Jsoup.parse(html, "UTF-8");
//all div tags wrapping a p tag
Elements divs = doc.select("div:has(p)");
//put the divs into a list
List<String> list = new ArrayList<String>();
for (Element div : divs) {
list.add(div.toString());
System.out.println(div + "\n");
}
}
}
markup.html markup.html
<!DOCTYPE html>
<head>
<meta charset="UTF-8" />
<title>whatever</title>
</head>
<body>
<div class=nameCouldBeAnything0>
<p>some text here</p>
</div>
<div class=nameCouldBeAnything1></div>
<div class=nameCouldBeAnything2>
<p>some more text here</p>
</div>
<div class=nameCouldBeAnything3>
<p>even more text here</p>
<p>and here</p>
<p>and here</p>
<p>and here</p>
<p>and here</p>
</div>
<div class=nameCouldBeAnything4>
<span>even more text here</span>
</div>
</body>
</html>
output 输出
<div class="nameCouldBeAnything0">
<p>some text here</p>
</div>
<div class="nameCouldBeAnything2">
<p>some more text here</p>
</div>
<div class="nameCouldBeAnything3">
<p>even more text here</p>
<p>and here</p>
<p>and here</p>
<p>and here</p>
<p>and here</p>
</div>
I was able to solve my dilemma. 我能够解决自己的困境。
This is the code I used, hopefully it helps someone in need. 这是我使用的代码,希望它可以帮助需要帮助的人。
Thanks to everyone that posted. 感谢所有张贴的人。
public static ArrayList proc(Document htmlFile)
{
Elements body = htmlFile.select("body");
ArrayList HTMLPlainText = new ArrayList();
HTMLPlainText.add(htmlFile.title());
for(Iterator<Element> it = body.iterator(); it.hasNext();)
{
Element pBody = it.next();
Elements. pTag = pBody.getElementsByTag("p");parents();
for(int pTagCount = 0; pTagCount < pTag.size(); pTagCount++)
{
Element p = pTag.get(pTagCount);
String pt = p.text();
if(pt.length() != 0)
{
HTMLPainText.add(pt);
pTagCount++:
}
pTag.parents().empty();
}
}
}
Note, there may be some syntax errors, I manually typed this in. 注意,可能有一些语法错误,我手动输入了此错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.