简体   繁体   中英

RSS Feed completely different to how displayed in Browser

So I am trying to programmatically parse an RSS feed for a podcast in Java using dom4j.

The code is like this, and runs smoothly for lots and lots of feeds:

BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream(), StandardCharsets.UTF_8));

String line;

while((line = reader.readLine()) != null)
{
    this.xmlData += line + "\n";
}
reader.close();

PrintWriter writer = new PrintWriter("rss_feed.txt", "UTF-8");
writer.println(this.xmlData);
writer.close();

this.document = DocumentHelper.parseText(this.xmlData);

Then I run into a problem feed! The url for the feed is: https://marxismtodaypodcast.wordpress.com/category/audio/feed/

Now the weird thing, is if I look at this page in a browser it looks like a normal RSS feed, replete with the elements that are expected.

Even this feed validator confirms it to be a properly formatted feed:

https://validator.w3.org/feed/

However, if I read in the url, and save it to a file, it looks nothing like the feed I see in the URL, and contains loads of javascript, and none of the normal < item > elements at all, not even in the javascript code.

The dom4j parser hates the feed I download from the url, and throws an array of funky exceptions, due to the page being a .html page and not an xml page.

I suspect the javascript in the page is somehow creating the output that we see in the browser. Is there any way I can download what we see in the browser instead of the raw javascript file? I would like to do this in a way which can be automated, so not too hacky!

Or maybe I am barking up the wrong tree altogether, and is something else going on?

EDIT 1: Attempted to Accept XML in HTTP Header

So I've tried to get the HttpURLConnection to accept xml, as suggested by commenter Julien Genestoux. Here is the code I tried:

HttpURLConnection connection = (HttpURLConnection)feed.openConnection();
connection.setRequestProperty("Accept","application/atom+xml,application/rdf+xml,application/rss+xml,application/xml,text/xml");
connection.connect();
String content_type = connection.getContentType();
System.out.println("content = " + content_type);

However, when I run this, I am getting the same data back, with the content as:

text/html; charset=UTF-8

Am I coding this correctly? I assume I have something wrong as this RSS feed does validate correctly, so it must be possible to get xml formatted data from this url....

What you're bumping into is a Content Negotiation problem. Basically, the HTTP client can ask the server to get the content in a specific format (it uses the Accept header as such) and the server can comply by sending the content in the format requested (or just ignore the request and serve the content in whatever it wants).

So, your problem is not so much to "convert" the content you received, but to get your HTTP library to ask only for the right format. To do this, just add an http header Accept with the following value: application/atom+xml,application/rdf+xml,application/rss+xml,application/xml,text/xml and the content you'll receive should be the right.

Also, if you want to deal with all this, think about using an API like Superfeedr which can do the polling AND parsing on your behalf and just send you normalized JSON.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM