简体   繁体   中英

Parsing text within blockquotes with Jsoup

I'm trying to parse the Javadocs with Jsoup but I'm having trouble extracting the text wrapped in blockquote tags.

Here is a sample of the HTML I'm attempting to parse:

<P>
The <code>String</code> class represents character strings. All
 string literals in Java programs, such as <code>"abc"</code>, are
 implemented as instances of this class.
 <p>
 Strings are constant; their values cannot be changed after they
 are created. String buffers support mutable strings.
 Because String objects are immutable they can be shared. For example:
 <p><blockquote><pre>
     String str = "abc";
 </pre></blockquote><p>
 is equivalent to:
 <p><blockquote><pre>
     char data[] = {'a', 'b', 'c'};
     String str = new String(data);
 </pre></blockquote><p>
 Here are some more examples of how strings can be used:
 <p><blockquote><pre>
     System.out.println("abc");
     String cde = "cde";
     System.out.println("abc" + cde);
     String c = "abc".substring(2,3);
     String d = cde.substring(1, 2);
 </pre></blockquote>
 <p>

I'm trying to use this code to parse out the text contained in the p tags:

        Document doc = Jsoup.parse(new File("/home/facetoe/ebooks/Java/docs/api/java/lang/String.html"), "UTF-8");
        Elements para = doc.getElementsByTag("P");

        for ( Element element : para ) {
            System.out.println(element);
        }

However no matter what I try the text contained in the blockquote tags just plain disappears.

Here is an sample of the output I get:

<p> The <code>String</code> class represents character strings. All string literals in Java programs, such as <code>&quot;abc&quot;</code>, are implemented as instances of this class. </p>
<p> Strings are constant; their values cannot be changed after they are created. String buffers support mutable strings. Because String objects are immutable they can be shared. For example: </p>
<p></p>
<p> is equivalent to: </p>
<p></p>
<p> Here are some more examples of how strings can be used: </p>
<p></p>
<p> The class <code>String</code> includes methods for examining individual characters of the sequence, for comparing strings, for searching strings, for extracting substrings, and for creating a copy of a string with all characters translated to uppercase or to lowercase. Case mapping is based on the Unicode Standard version specified by the <a href="../../java/lang/Character.html" title="class in java.lang"><code>Character</code></a> class. </p>
<p> The Java language provides special support for the string concatenation operator (&nbsp;+&nbsp;), and for conversion of other objects to strings. String concatenation is implemented through the <code>StringBuilder</code>(or <code>StringBuffer</code>) class and its <code>append</code> method. String conversions are implemented through the method <code>toString</code>, defined by <code>Object</code> and inherited by all classes in Java. For additional information on string concatenation and conversion, see Gosling, Joy, and Steele, <i>The Java Language Specification</i>. </p>
<p> Unless otherwise noted, passing a <tt>null</tt> argument to a constructor or method in this class will cause a <a href="../../java/lang/NullPointerException.html" title="class in java.lang"><code>NullPointerException</code></a> to be thrown. </p>
<p>A <code>String</code> represents a string in the UTF-16 format in which <em>supplementary characters</em> are represented by <em>surrogate pairs</em> (see the section <a href="Character.html#unicode">Unicode Character Representations</a> in the <code>Character</code> class for more information). Index values refer to <code>char</code> code units, so a supplementary character uses two positions in a <code>String</code>. </p>
<p>The <code>String</code> class provides methods for dealing with Unicode code points (i.e., characters), in addition to those for dealing with Unicode code units (i.e., <code>char</code> values). </p>
<p> </p>

It's like Jsoup just drops anything in wrapped in blockquote tags. Does anyone have any idea how I can retain those tags and extract the text from within them?

The reason is Jsoup builds the DOM such that the blockquote elements are outside of paragraphs. You can see that by printing the doc object. I think a blockquote element automatically terminates a previous p element (which does not require a closing p tag). You can observe the same thing if you load the html in a modern browser and inspect the elements.

Also see the HTML 4.01 specification - "The P element represents a paragraph. It cannot contain block-level elements (including P itself)." I'm sure there's something similar in HTML5.

So by iterating only through the paragraphs, you are missing the blockquotes which are not contained in them.

Looking at the JSoup documentation for the parse method, it would appear that they use a whitelist mechanism to decide what's safe and what's not. Perhaps you need to setup a while list before parsing? Although that only seems to apply to the clean method. So it might be something else.

您没有关闭<p>标签,这可能是问题所在。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM