简体   繁体   中英

Jsoup fails to parse elements on very rare occasion

I've migrated the RSS parsing in my application from to recently and when trying to parse files from a source, Jsoup will fail to parse < and > correctly, leading to &lt; and &gt; in the retrieved Document , further leading to issues when trying to use Document::select .

MCVE

import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;

import java.io.IOException;
import java.util.Collection;

public class MCVE {
    public static void main(final String[] args) throws IOException {
        Jsoup.connect("https://rss.packetstormsecurity.com/files/page18")
             .parser(Parser.xmlParser())
             .get()
             .select("item")
             .stream()
             .map(e -> e.select("pubDate"))
             .flatMap(Collection::stream)
             .map(Element::text)
             .forEach(System.out::println);
    }
}

The above code will currently (The RSS feed is constantly updating, and the problem doesn't occur from local files) print the following:

Wed, 22 Nov 2017 15:29:54 GMT
Wed, 22 Nov 2017 15:29:43 GMT
Wed, 22 Nov 2017 15:29:36 GMT
Wed, 22 Nov 2017 15:29:28 GMT
Wed, 22 Nov 2017 15:29:22 GMT
Wed, 22 Nov 2017 15:27:23 GMT
Tue, 21 Nov 2017 23:23:23 GMT
Tue, 21 Nov 2017 19:21:38 GMT
Tue, 21 Nov 2017 19:20:12 GMT
Tue, 21 Nov 2017 19:18:15 GMT
Tue, 21 Nov 2017 19:16:17 GMT
Tue, 21 Nov 2017 19:14:37 GMT
Tue, 21 Nov 2017 19:13:34 GMT
Tue, 21 Nov 2017 19:11:33 GMT
Tue, 21 Nov 2017 19:07:49 GMT
Tue, 21 Nov 2017 19:06:56 GMT
Tue, 21 Nov 2017 19:04:19 GMT
Tue, 21 Nov 2017 19:03:57 GMT
Tue, 21 Nov 2017 10:11:11 GMT
Tue, 21 Nov 2017 04:54:00 GMT
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
Mon, 20 Nov 2017 22:22:00 GMT
Mon, 20 Nov 2017 16:16:00 GMT
Mon, 20 Nov 2017 16:15:00 GMT
Mon, 20 Nov 2017 16:14:00 GMT

This is a snippet from the Document returned to me by Jsoup.

<item> 
 <title>Ubuntu Security Notice USN-3483-2</title> 
 <link>
  https://packetstormsecurity.com/files/145055/USN-3483-2.txt
 </link> 
 <guid isPermaLink="true">
  https://packetstormsecurity.com/files/145055/USN-3483-2.txt
 </guid> 
 <comments>
  https://packetstormsecurity.com/files/145055/Ubuntu-Security-Notice-USN-3483-2.html
 </comments> 
 <pubDate>
  Tue, 21 Nov 2017 04:04:00 GMT&lt;/pubDate&gt; <!-- the affected line -->
  <description>
   Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
  </description> 
  <category></category> 
 </pubDate>
</item>

Here, some of the characters have been parsed wrongly while the xml on the website is well formed.


When using the same URL with a trailing slash ( https://rss.packetstormsecurity.com/files/page18/ ), the problem does not occur on the same page, however it will occur on different pages instead.

The pages of the feed on which the problem will occur will also change due to the active nature of the feed. If the problem fails to occur on page 18, I will update with a new page. It will also not occur if the file is downloaded separately and then parsed with Jsoup::parse .

The Jsoup version is 1.11.2 .

Additional MCVE

This MCVE shows that the problem occurs only when Parsing the response with Jsoup, the actual downloaded XML is fine:

import org.jsoup.Connection;
import org.jsoup.Jsoup;

import java.io.IOException;

public class MCVE {
    public static void main(final String[] args) throws IOException {
        final Connection.Response response = Jsoup.connect("https://rss.packetstormsecurity.com/files/page18").execute();

        // Well formed XML
        System.out.println(response.body());

        // Malformed XML
        System.out.println(response.parse());
    }
}

This appears to be a bug in org.jsoup.helper.HttpConnection::get and org.jsoup.helper.HttpConnection.Response::parse , here's my corresponding github issue and here's a repo replicating the bug.

This will be fixed in Jsoup 1.11.3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM