简体   繁体   中英

Dealing with pagination in web pages while using jsoup

I have been using jsoup to crawl through webpages of a particular website. Basically i am trying to extract all the href's that have a link of a pdf. I have been successful in getting all the link of a particular page . But there are 10 such pages. The web pages uses a logic of javascript _doPostBack() function to navigate to other pages. How do i get this done by jsoup.

This is how i am trying it right now

Document document = Jsoup.connect(" some website name")
                        .data("__EVENTARGUMENT", __EVENTARGUMENT)
                        .data("__EVENTTARGET", __EVENTTARGET)
                        .data("__EVENTVALIDATION", __EVENTVALIDATION)
                        .data("__VIEWSTATEGENERATOR ", __VIEWSTATEGENERATOR)
                        .cookie("ASP.NET_SessionId", sessionId)
                        .followRedirects(true)
                        .timeout(0)
                        .userAgent(
                            "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
                        .post(); 

But i am getting a false url output. I have defined all the variables before sending.

When I hit this kind of problem, here how I solve them:

  • Load the page in a browser
  • Spy the http messages exchanged between the browser and the server while going through the pages (Fiddler, Firebug, Dev Console/Toolbar ...)
  • Identify every single bytes browser and server exchange (headers, cookies etc)
  • Once ALL single bytes identified try to go through the pages with hurl.it (enter headers, cookies, user-agent etc)
  • Once you succeed going through pages with hurl.it, instruct Jsoup to do the same

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM