Can't get all content from webpage with HTMLParser

Question

I am using Jsoup to parse an webpage this one https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true

In that webpage i can see something in the browser but when i am trying to parse it with Jsoup

Document doc = Jsoup.parse("https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true");
System.out.println(doc);

It will return

<html>
<head></head>
<body>
https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&amp;kind=100&amp;nochange=true&amp;allPrio=trueμltiple=true&amp;allEx=true
</body>
</html>

Which is not all HTML.

Any suggestions how i can solve it or why it is happening?

Answer 1

That looks like they're detecting a crawler, usually via your user agent, and sending different content. Try setting your user agent string to a standard browser's string, and see if that resolves the issue you're having.

One other potential problem, though I don't think it's the issue here, is data loaded in via AJAX will not be downloaded by JSoup. It parses the HTML that gets served up, but it doesn't execute the JavaScript, so it can't get any extra content that comes in later. You might be able to resolve that issue using something like PhantomJS which can process and render HTML, CSS, and JavaScript, and would (in theory) give you the actual HTML you end up seeing in your browser.

Can't get all content from webpage with HTMLParser

Question

1 answers

solution1
1 ACCPTED 2012-09-01 01:13:31

Can't get all content from webpage with HTMLParser

Question

1 answers

solution1 1 ACCPTED 2012-09-01 01:13:31

solution1
1 ACCPTED 2012-09-01 01:13:31