简体   繁体   中英

Can't get all content from webpage with HTMLParser

I am using Jsoup to parse an webpage this one https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true

In that webpage i can see something in the browser but when i am trying to parse it with Jsoup

Document doc = Jsoup.parse("https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&kind=100&nochange=true&allPrio=true&multiple=true&allEx=true");
System.out.println(doc);

It will return

<html>
<head></head>
<body>
https://daisy.dsv.su.se/servlet/schema.MomentInfoRuta?id=261020&amp;kind=100&amp;nochange=true&amp;allPrio=trueμltiple=true&amp;allEx=true
</body>
</html>

Which is not all HTML.

Any suggestions how i can solve it or why it is happening?

That looks like they're detecting a crawler, usually via your user agent, and sending different content. Try setting your user agent string to a standard browser's string, and see if that resolves the issue you're having.

One other potential problem, though I don't think it's the issue here, is data loaded in via AJAX will not be downloaded by JSoup. It parses the HTML that gets served up, but it doesn't execute the JavaScript, so it can't get any extra content that comes in later. You might be able to resolve that issue using something like PhantomJS which can process and render HTML, CSS, and JavaScript, and would (in theory) give you the actual HTML you end up seeing in your browser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM