繁体   English   中英

web 刮刀最后分页用 python

[英]web scraper last pagination with python

是否可以打印 web 上的所有页面?

我想打印所有页面直到最后一页,而在分页中只有 9 个列表,而在 web 上有 24 页

import requests
from bs4 import BeautifulSoup

def login():    
    print('test login')
    urls = "https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=0"
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US,en;q=0.8',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
    }

    respon = requests.get(urls, headers=headers)
    f = open('./re.html', 'w+')
    f.write(respon.text)
    f.close()

    soup = BeautifulSoup(respon.text, 'html5lib')

    page_item = soup.find_all('div', attrs={'class': 'pagination-link-container__09f24__13AN7'})
    total_page = len(page_item)

    print(len(page_item))

    return total_page 

def get_url(page):
    print('test url ...')
    params = {
        'start': page
    }

    respon = requests.get('https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=0', params=params)
    

    suop = BeautifulSoup(respon.text, 'html5lib')

    titles = suop.find_all('span', attrs={'class': 'text__09f24__2tZKC text-color--black-regular__09f24__1QxyO text-align--left__09f24__3Drs0 text-weight--bold__09f24__WGVdT text-size--inherit__09f24__2rwpp'})

    urls = []
    for title in titles:
        url = title.find('a')['href']
        urls.append(url)

    return  urls


def run():
    total_page = login()

    total_urls = []
    for i in range(total_page):
        page = i + 1

        # print(page)
        urls =  get_url(page)
        total_urls += urls
    
    # print(total_urls)
    print(len(total_urls))


if __name__ == '__main__':
    run()

所以浏览分页链接后,很明显可以使用 URL 末尾的参数调整 url。 点击链接后,我发现了以下内容:

分页按钮 URL Header
1 start=0
2 start=20
3 start=40

清晰的模式是计数器从 0 开始,每页增加 20。 现在,您需要做的就是提取最大页数,然后使用简单的 for 循环来获取您的 url。 幸运的是,在分页 window 旁边的页面右端有一个小框显示有多少页。

这是我的例子。 此代码仅打印出 URL


    import requests
    from bs4 import BeautifulSoup
    
    #Base url
    url = "https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA"
    
    #Just getting the soup from the page
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    
    #Extracting the maximum number of pages
    #I used a CSS selector but anything can be used to grab it
    
    selector = ".text-align--center__09f24__31irQ .text-align--left__09f24__3Drs0"
    num = soup.select(selector)[0].text #This gives the string "1 of 12"
    
    #Splitting the string to get just the maximum
    num = num.split("of ")[1]
    num = int(num) #Now num = 12
    
    #Using a for loop to count up and concat the value to the url:
    for i in range(0, num*20, 20):
        print(url + "&start=" + str(i))

这是我的 output:


    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=0
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=20
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=40
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=60
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=80
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=100
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=120
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=140
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=160
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=180
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=200
    https://www.yelp.com/search?cflt=contractors&find_loc=St%20Francis%20Wood%2C%20San%20Francisco%2C%20CA&start=220

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM