I am trying to do scraping excise using python requests and beautifulsoup. Basically i am crawling amazon web page. I am able to crawl the first page without any issues.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing
But when I try to crawl the 2nd page with "#2" in urls
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")
I see r still has same value that is equivalent to the value of 1 page.
r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
Dont know is #2 causing any trouble while making request to second page. I also google about the issues but I could not find a fix. What is right way to make request to url with #values. How to address this issue. Please advice.
"#2" is an fragment identifier , it's not visible on the server-side. Html content that you get, opening " http://someurl.com/page#123 " is same as content for " http://someurl.com/page ".
In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:
Looks like our url is:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj
Easily we can understand that all we need is to change "pg" param value to get another pages.
You need to request to the url in the href
attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url
is like:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1
and the second page's url is like this:
http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2
a
tag for the second page is like this:
<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>
So you need to change the request url.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.