简体   繁体   English

用Python抓取网站的第二页不起作用

[英]Scraping the second page of a website in Python does not work

Let's say I want to scrape the data here . 假设我要在这里抓取数据。

I can do it nicely using urlopen and BeautifulSoup in Python 2.7. 我可以在python 2.7中使用urlopenBeautifulSoup很好地做到这一点。

Now if I want to scrape data from the second page with this address . 现在,如果我想使用该地址从第二页抓取数据。

What I get is the data from the first page! 我得到的是第一页的数据! I looked at the page source of the second page using "view page source" of Chrome and the content belongs to first page! 用Chrome的“查看页面源”看了第二页的页面源,内容属于第一页!

How can I scrape the data from the second page? 如何从第二页抓取数据?

The page is of a quite asynchronous nature, there are XHR requests forming the search results, simulate them in your code using requests . 该页面具有非常异步的性质,有XHR请求构成搜索结果,并使用requests在代码中模拟它们。 Sample code as a starting point for you: 示例代码是您的起点:

from bs4 import BeautifulSoup
import requests

url = 'http://www.amazon.com/Best-Sellers-Books-Architecture/zgbs/books/173508/#2'
ajax_url = "http://www.amazon.com/Best-Sellers-Books-Architecture/zgbs/books/173508/ref=zg_bs_173508_pg_2"

def get_books(data):
    soup = BeautifulSoup(data)

    for title in soup.select("div.zg_itemImmersion div.zg_title a"):
        print title.get_text(strip=True)


with requests.Session() as session:
    session.get(url)

    session.headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
        'X-Requested-With': 'XMLHttpRequest'
    }

    for page in range(1, 10):
        print "Page #%d" % page

        params = {
            "_encoding": "UTF8",
            "pg": str(page),
            "ajax": "1"
        }
        response = session.get(ajax_url, params=params)
        get_books(response.content)

        params["isAboveTheFold"] = "0"
        response = session.get(ajax_url, params=params)
        get_books(response.content)

And don't forget to be a good web-scraping citizen and follow the Terms of Use. 并且不要忘记成为一名良好的网络爬虫公民并遵守使用条款。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM