简体   繁体   中英

Python Scrape with requests and beautifulsoup

I am trying to do scraping excise using python requests and beautifulsoup. Basically i am crawling amazon web page. I am able to crawl the first page without any issues.

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")
#do some thing 

But when I try to crawl the 2nd page with "#2" in urls

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers#2")

I see r still has same value that is equivalent to the value of 1 page.

r = requests.get("http://www.amazon.in/gp/bestsellers/books/ref=nav_shopall_books_bestsellers")

Dont know is #2 causing any trouble while making request to second page. I also google about the issues but I could not find a fix. What is right way to make request to url with #values. How to address this issue. Please advice.

"#2" is an fragment identifier , it's not visible on the server-side. Html content that you get, opening " http://someurl.com/page#123 " is same as content for " http://someurl.com/page ".

In browser you see second page because page's javascript see fragment identifier, create ajax request and inject new content into page. You should find ajax request's url and use it:

在此处输入图片说明

Looks like our url is:

http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&aj

Easily we can understand that all we need is to change "pg" param value to get another pages.

You need to request to the url in the href attribute of the anchor tags describing the pagination. It's at the bottom of the page. If I inspect the page in developer console in google chrome I find the first pages url is like:

http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_1?ie=UTF8&pg=1

and the second page's url is like this:

http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2

a tag for the second page is like this:

<a page="2" ajaxUrl="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2&ajax=1" href="http://www.amazon.in/gp/bestsellers/books/ref=zg_bs_books_pg_2?ie=UTF8&pg=2">21-40</a>

So you need to change the request url.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM