Jsoup-如何提取每个元素

Question

I'm trying to get font information by using Jsoup. 我正在尝试通过使用Jsoup获取字体信息。 For an example: 例如： 样例字体

Below is my code: 下面是我的代码：

result = rtfToHtml(new StringReader(streamToString((InputStream)contents.getTransferData(dfRTF))));
                // Example of text extraction from html
                // Parse html
                // String test = result.toString();
                Document doc = Jsoup.parse(result);
                // Select first bold text
                String strdoc = doc.toString();
                String words[] = strdoc.split("font-family");
                Element firstBoldElt = doc.select("b").first(); 
                Elements ele = doc.select("body");
                String test = ele.toString();
                Elements all = doc.select("b");
                String boldtext = all.text();

By using the code my output will be like below: 通过使用代码，我的输出将如下所示：

"<body> 
 <p class="default">
     <span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
         <b>Hello World</b>
     </span>
     <span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">, Testing</span> 
     <span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
         <i><b>Font </b></i>
     </span>
     <span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;"> Style</span>
     <span style="color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;">
         <i>Check</i>
     </span>
     <span style="color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;"></span>
</p>   
</body>"

I can extract first BOLD element or all BOLD element but how do I can all element similar like this. 我可以提取第一个BOLD元素或所有BOLD元素，但是如何将所有类似的元素提取出来。

<b>Hello World</b>
, Testing
<i><b>Font </b></i>
 Style 
<i>Check</i>

Any advice or references is highly appreciated. 任何建议或参考均受到高度赞赏。
EDITED EDITED

<body lang="en-MY" dir="LTR"> 
 <p style="margin-bottom: 0in">
 <font color="#000000"> <font face="ArialMT, serif"> <font size="2">
 <span style="font-style: normal">
 <span style="text-decoration: none">
 <b>BOLD </b>
 </span>
 </span>
 </font></font></font>
 <font color="#000000"><font face="ArialMT, serif"><font size="2">
 <span style="font-style: normal">
 <span style="text-decoration: none">
 <span style="font-weight: normal">
 REGULAR 
 </span>
 </span>
 </span>
 </font></font></font>
 <font color="#000000"><font face="ArialMT, serif"><font size="2">
 <span style="font-style: normal">
 <u>
 <span style="font-weight: normal">
 UNDERLINED
 </span>
 </u>
 </span>
 </font></font></font>
 <font color="#000000"><font face="ArialMT, serif"><font size="2">
 <span style="font-style: normal">
 <span style="text-decoration: none">
 <span style="font-weight: normal"> 
 </span>
 </span>
 </span>
 </font></font></font>
 <font color="#000000"><font face="ArialMT, serif"><font size="2">
 <i>
 <span style="text-decoration: none">
 <span style="font-weight: normal">
 ITALIC
 </span>
 </span>
 </i>
 </font></font></font>
 <font color="#000000"><font face="ArialMT, serif"><font size="2">
 <span style="font-style: normal">
 <span style="text-decoration: none">
 <span style="font-weight: normal"> 
 </span>
 </span>
 </span>
 </font></font></font>
 <font color="#000000"><font face="ArialMT, serif"><font size="2">
 <i>
 <span style="text-decoration: none">
 <b>BOLDITALIC</b>
 </span>
 </i></font>
 </font></font></p>  
</body>

Answer 1

If you only need to extract the text from a document, plus any  or  tags (as per your example), consider using the Whitelist class (see docs ): 如果只需要从文档中提取文本，再加上任何或标记（按照您的示例），请考虑使用Whitelist类（请参阅docs ）：

String html = "<body><p class='default'> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <b>Hello World</b> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> , Testing </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i><b>Font </b></i> </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> Style </span> <span style='color: #000000; font-size: 21pt; font-family: MyriadPro-Bold;'> <i>Check</i> </span> <span style='color: #000000; font-size: 10pt; font-family: MyriadPro-Bold;'> </span> </p></body>";

Whitelist wl = Whitelist.simpleText();
wl.addTags("b", "i"); // add additional tags here as necessary
String clean = Jsoup.clean(html, wl);
System.out.println(clean);

Which will output (as per your example): 将输出（根据您的示例）：

11-07 19:04:45.738: I/System.out(318): <b>Hello World</b>   , Testing   
11-07 19:04:45.738: I/System.out(318): <i><b>Font </b></i>   Style   
11-07 19:04:45.738: I/System.out(318): <i>Check</i>

Update: 更新：

ArrayList<String> elements = new ArrayList<String>();

Elements e = doc.select("span");

for (int i = 0; i < e.size(); i++) {
    elements.add(e.get(i).html());
}

Answer 2

You need to change your selector to the  tag like so: 您需要将选择器更改为标记，如下所示：
Element all = doc.select("p").first();

Then you need to get all the children of that element. 然后，您需要获取该元素的所有子元素。

String myString = "";
for(Element item : all.children()) {
    myString += item.text();
}

I am assuming you want the text inside the tags, and not the tags themselves. 我假设您想要标签内的文本，而不是标签本身。

Alternatively you could do. 或者，您可以这样做。

Elements all = doc.select("b");
all.addAll(doc.select("i"));
all.addAll(doc.select("span"));
String myString = all.text();

Jsoup-如何提取每个元素

问题描述

2 个解决方案

解决方案1
2 已采纳 2013-11-07 09:10:45

解决方案2
1 2013-11-07 09:02:47

Jsoup-如何提取每个元素

问题描述

2 个解决方案

解决方案1 2 已采纳 2013-11-07 09:10:45

解决方案2 1 2013-11-07 09:02:47

解决方案1
2 已采纳 2013-11-07 09:10:45

解决方案2
1 2013-11-07 09:02:47