Parsing text within blockquotes with Jsoup

Question

I'm trying to parse the Javadocs with Jsoup but I'm having trouble extracting the text wrapped in blockquote tags.

Here is a sample of the HTML I'm attempting to parse:

<P>
The <code>String</code> class represents character strings. All
 string literals in Java programs, such as <code>"abc"</code>, are
 implemented as instances of this class.
 <p>
 Strings are constant; their values cannot be changed after they
 are created. String buffers support mutable strings.
 Because String objects are immutable they can be shared. For example:
 <p><blockquote><pre>
     String str = "abc";
 </pre></blockquote><p>
 is equivalent to:
 <p><blockquote><pre>
     char data[] = {'a', 'b', 'c'};
     String str = new String(data);
 </pre></blockquote><p>
 Here are some more examples of how strings can be used:
 <p><blockquote><pre>
     System.out.println("abc");
     String cde = "cde";
     System.out.println("abc" + cde);
     String c = "abc".substring(2,3);
     String d = cde.substring(1, 2);
 </pre></blockquote>
 <p>

I'm trying to use this code to parse out the text contained in the p tags:

        Document doc = Jsoup.parse(new File("/home/facetoe/ebooks/Java/docs/api/java/lang/String.html"), "UTF-8");
        Elements para = doc.getElementsByTag("P");

        for ( Element element : para ) {
            System.out.println(element);
        }

However no matter what I try the text contained in the blockquote tags just plain disappears.

Here is an sample of the output I get:

<p> The <code>String</code> class represents character strings. All string literals in Java programs, such as <code>&quot;abc&quot;</code>, are implemented as instances of this class. </p>
<p> Strings are constant; their values cannot be changed after they are created. String buffers support mutable strings. Because String objects are immutable they can be shared. For example: </p>
<p></p>
<p> is equivalent to: </p>
<p></p>
<p> Here are some more examples of how strings can be used: </p>
<p></p>
<p> The class <code>String</code> includes methods for examining individual characters of the sequence, for comparing strings, for searching strings, for extracting substrings, and for creating a copy of a string with all characters translated to uppercase or to lowercase. Case mapping is based on the Unicode Standard version specified by the <a href="../../java/lang/Character.html" title="class in java.lang"><code>Character</code></a> class. </p>
<p> The Java language provides special support for the string concatenation operator (&nbsp;+&nbsp;), and for conversion of other objects to strings. String concatenation is implemented through the <code>StringBuilder</code>(or <code>StringBuffer</code>) class and its <code>append</code> method. String conversions are implemented through the method <code>toString</code>, defined by <code>Object</code> and inherited by all classes in Java. For additional information on string concatenation and conversion, see Gosling, Joy, and Steele, <i>The Java Language Specification</i>. </p>
<p> Unless otherwise noted, passing a <tt>null</tt> argument to a constructor or method in this class will cause a <a href="../../java/lang/NullPointerException.html" title="class in java.lang"><code>NullPointerException</code></a> to be thrown. </p>
<p>A <code>String</code> represents a string in the UTF-16 format in which <em>supplementary characters</em> are represented by <em>surrogate pairs</em> (see the section <a href="Character.html#unicode">Unicode Character Representations</a> in the <code>Character</code> class for more information). Index values refer to <code>char</code> code units, so a supplementary character uses two positions in a <code>String</code>. </p>
<p>The <code>String</code> class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., <code>char</code> values). </p>
<p> </p>

It's like Jsoup just drops anything in wrapped in blockquote tags. Does anyone have any idea how I can retain those tags and extract the text from within them?

Answer 1

The reason is Jsoup builds the DOM such that the blockquote elements are outside of paragraphs. You can see that by printing the doc object. I think a blockquote element automatically terminates a previous p element (which does not require a closing p tag). You can observe the same thing if you load the html in a modern browser and inspect the elements.

Also see the HTML 4.01 specification - "The P element represents a paragraph. It cannot contain block-level elements (including P itself)." I'm sure there's something similar in HTML5.

So by iterating only through the paragraphs, you are missing the blockquotes which are not contained in them.

Answer 2

Looking at the JSoup documentation for the parse method, it would appear that they use a whitelist mechanism to decide what's safe and what's not. Perhaps you need to setup a while list before parsing? Although that only seems to apply to the clean method. So it might be something else.

Answer 3

您没有关闭<p>标签，这可能是问题所在。

Parsing text within blockquotes with Jsoup

Question

3 answers

solution1
1 ACCPTED 2013-09-26 20:01:34

solution2
0 2013-09-26 13:14:23

solution3
0 2013-09-26 14:18:00

Parsing text within blockquotes with Jsoup

Question

3 answers

solution1 1 ACCPTED 2013-09-26 20:01:34

solution2 0 2013-09-26 13:14:23

solution3 0 2013-09-26 14:18:00

solution1
1 ACCPTED 2013-09-26 20:01:34

solution2
0 2013-09-26 13:14:23

solution3
0 2013-09-26 14:18:00