简体   繁体   English

如何使用不变的URL刮取多个页面 - Python 3

[英]How to scrape multiple pages with an unchanging URL - Python 3

I recently got in touch with web scraping and tried to web scrape various pages. 我最近联系了网页抓取并尝试网页抓取各种页面。 For now, I am trying to scrape the following site - http://www.pizzahut.com.cn/StoreList 目前,我正试图抓住以下网站 - http://www.pizzahut.com.cn/StoreList

So far I've used selenium to get the longitude and latitude scraped. 到目前为止,我已经使用硒来获取经度和纬度。 However, my code right now only extracts the first page. 但是,我的代码现在只提取第一页。 I know there is a dynamic web scraping that executes javascript and loads different pages, but had hard time trying to find a right solution. 我知道有一个动态网页抓取执行javascript并加载不同的页面,但很难找到一个正确的解决方案。 I was wondering if there's a way to access the other 49 pages or so, because when I click next page the URL does not change because it is set, so I cannot just iterate over a different URL each time 我想知道是否有办法访问其他49页左右,因为当我点击下一页时,URL不会因为设置而改变,所以我不能每次都迭代一个不同的URL

Following is my code so far: 以下是我的代码到目前为止:

import os
import requests
import csv
import sys
import time
from bs4 import BeautifulSoup

page = requests.get('http://www.pizzahut.com.cn/StoreList')

soup = BeautifulSoup(page.text, 'html.parser')

for row in soup.find_all('div',class_='re_RNew'):
    name = row.find('p',class_='re_NameNew').string
    info = row.find('input').get('value')
    location = info.split('|')
    location_data = location[0].split(',')
    longitude = location_data[0]
    latitude = location_data[1]
    print(longitude, latitude)

Thank you so much for helping out. 非常感谢您的帮助。 Much appreciated 非常感激

Steps to get the data: 获取数据的步骤:

Open the developer tools in your browser (for Google Chrome it's Ctrl + Shift + I ). 在浏览器中打开开发人员工具(对于谷歌浏览器,它是Ctrl + Shift + I )。 Now, go to the XHR tab which is located inside the Network tab. 现在,转到位于“ Network选项卡内的XHR选项卡。

在此输入图像描述

After doing that, click on the next page button. 完成后,单击下一页按钮。 You'll see the following file. 您将看到以下文件。

在此输入图像描述

Click on that file. 单击该文件。 In the General block, you'll see these 2 things that we need. General块中,您将看到我们需要的这两件事。

在此输入图像描述

Scrolling down, in the Form Data tab, you can see the 3 variables as 向下滚动,在“ 表单数据”选项卡中,您可以看到3个变量

在此输入图像描述

Here, you can see that changing the value of pageIndex will give all the pages required. 在这里,您可以看到更改pageIndex的值将提供所需的所有页面。

Now, that we've got all the required data, we can write a POST method for the URL http://www.pizzahut.com.cn/StoreList/Index using the above data. 现在,我们已经获得了所有必需的数据,我们可以使用上述数据为URL http://www.pizzahut.com.cn/StoreList/Index编写POST方法。

Code: 码:

I'll show you the code to scrape first 2 pages, you can scrape any number of pages you want by changing the range() . 我将向您展示刮取前2页的代码,您可以通过更改range()来刮取您想要的任意数量的页面。

for page_no in range(1, 3):
    data = {
        'pageIndex': page_no,
        'pageSize': 10,
        'keyword': '输入餐厅地址或餐厅名称'
    }
    page = requests.post('http://www.pizzahut.com.cn/StoreList/Index', data=data)
    soup = BeautifulSoup(page.text, 'html.parser')

    print('PAGE', page_no)
    for row in soup.find_all('div',class_='re_RNew'):
        name = row.find('p',class_='re_NameNew').string
        info = row.find('input').get('value')
        location = info.split('|')
        location_data = location[0].split(',')
        longitude = location_data[0]
        latitude = location_data[1]
        print(longitude, latitude)

Output: 输出:

PAGE 1
31.085877 121.399176
31.271117 121.587577
31.098122 121.413396
31.331458 121.440183
31.094581 121.503654
31.270737000 121.481178000
31.138214 121.386943
30.915685 121.482079
31.279029 121.529255
31.168283 121.283322
PAGE 2
31.388674 121.35918
31.231706 121.472644
31.094857 121.219961
31.228564 121.516609
31.235717 121.478692
31.288498 121.521882
31.155139 121.428885
31.235249 121.474639
30.728829 121.341429
31.260372 121.343066

Note: You can change the results per page by changing the value of pageSize (currently it's 10). 注意:您可以通过更改pageSize的值来更改每页的结果(当前为10)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM