[英]Jsoup - How to extract every elements
我正在嘗試通過使用Jsoup獲取字體信息。 例如:
下面是我的代碼:
result = rtfToHtml(new StringReader(streamToString((InputStream)contents.getTransferData(dfRTF))));
// Example of text extraction from html
// Parse html
// String test = result.toString();
Document doc = Jsoup.parse(result);
// Select first bold text
String strdoc = doc.toString();
String words[] = strdoc.split("font-family");
Element firstBoldElt = doc.select("b").first();
Elements ele = doc.select("body");
String test = ele.toString();
Elements all = doc.select("b");
String boldtext = all.text();
通過使用代碼,我的輸出將如下所示:
"<body>
<p class="default">
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
<b>Hello World</b>
</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">, Testing</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
<i><b>Font </b></i>
</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;"> Style</span>
<span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
<i>Check</i>
</span>
<span style="color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;"></span>
</p>
</body>"
我可以提取第一個BOLD元素或所有BOLD元素,但是如何將所有類似的元素提取出來。
<b>Hello World</b>
, Testing
<i><b>Font </b></i>
Style
<i>Check</i>
任何建議或參考均受到高度贊賞。
EDITED
<body lang="en-MY" dir="LTR">
<p style="margin-bottom: 0in">
<font color="#000000"> <font face="ArialMT, serif"> <font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<b>BOLD </b>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<span style="font-weight: normal">
REGULAR
</span>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<u>
<span style="font-weight: normal">
UNDERLINED
</span>
</u>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<span style="font-weight: normal">
</span>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<i>
<span style="text-decoration: none">
<span style="font-weight: normal">
ITALIC
</span>
</span>
</i>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<span style="font-style: normal">
<span style="text-decoration: none">
<span style="font-weight: normal">
</span>
</span>
</span>
</font></font></font>
<font color="#000000"><font face="ArialMT, serif"><font size="2">
<i>
<span style="text-decoration: none">
<b>BOLDITALIC</b>
</span>
</i></font>
</font></font></p>
</body>
如果只需要從文檔中提取文本,再加上任何<b>
或<i>
標記(按照您的示例),請考慮使用Whitelist類(請參閱docs ):
String html = "<body><p class='default'> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <b>Hello World</b> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> , Testing </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i><b>Font </b></i> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> Style </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i>Check</i> </span> <span style='color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;'> </span> </p></body>";
Whitelist wl = Whitelist.simpleText();
wl.addTags("b", "i"); // add additional tags here as necessary
String clean = Jsoup.clean(html, wl);
System.out.println(clean);
將輸出(根據您的示例):
11-07 19:04:45.738: I/System.out(318): <b>Hello World</b> , Testing
11-07 19:04:45.738: I/System.out(318): <i><b>Font </b></i> Style
11-07 19:04:45.738: I/System.out(318): <i>Check</i>
更新:
ArrayList<String> elements = new ArrayList<String>();
Elements e = doc.select("span");
for (int i = 0; i < e.size(); i++) {
elements.add(e.get(i).html());
}
您需要將選擇器更改為<p>
標記,如下所示:
Element all = doc.select("p").first();
然后,您需要獲取該元素的所有子元素。
String myString = "";
for(Element item : all.children()) {
myString += item.text();
}
我假設您想要標簽內的文本,而不是標簽本身。
或者,您可以這樣做。
Elements all = doc.select("b");
all.addAll(doc.select("i"));
all.addAll(doc.select("span"));
String myString = all.text();
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.