繁体   English   中英

美丽的汤-刮多页

[英]Beautiful Soup - Scrape multiple pages

如何从网站上抓取多个页面? 此代码仅适用于第一个。 任何意见,将不胜感激。 谢谢。

import csv
import requests
from bs4 import BeautifulSoup

import datetime

filename = "azet_" + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M")+".csv"
with open(filename, "w+") as f:
    writer = csv.writer(f)
    writer.writerow(["Descriere","Pret","Data"])

    r = requests.get("https://azetshop.ro/12-extensa?page=1")

    soup = BeautifulSoup(r.text, "html.parser")
    x = soup.find_all("div", "thumbnail")

    for thumbnail in x:
        descriere = thumbnail.find("h3").text.strip()
        pret = thumbnail.find("price").text.strip()

        writer.writerow([descriere, pret, datetime.datetime.now()]) 

对于使用BeautifulSoup进行的多页剪贴,很多人通常使用while

import csv
import requests
from bs4 import BeautifulSoup    
import datetime

end_page_num = 50

filename = "azet_" + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M")+".csv"
with open(filename, "w+") as f:

    writer = csv.writer(f)
    writer.writerow(["Descriere","Pret","Data"])
    i = 1
    while i <= end_page_num:

        r = requests.get("https://azetshop.ro/12-extensa?page={}".format(i))

        soup = BeautifulSoup(r.text, "html5lib")
        x = soup.find_all("div", {'class': 'thumbnail-container'})

        for thumbnail in x:
            descriere = thumbnail.find('h1', {"class": "h3 product-title"}).text.strip()
            pret = thumbnail.find('span', {"class": "price"}).text.strip()
            writer.writerow([descriere, pret, datetime.datetime.now()])
        i += 1

在这里, i将以1增量随着页面的抓取完成而变化。 这将继续end_page_num直到您定义的end_page_num

此代码也可以很好地使用bs4的class属性

            import csv
            import requests
            from bs4 import BeautifulSoup
            import datetime

            filename = "azet_" + datetime.datetime.now().strftime("%Y-%m-%d-%H-%M")+".csv"
            with open(filename, "w+") as f:
                writer = csv.writer(f)
                writer.writerow(["Descriere","Pret","Data"])

                for i in range(1,50):
                    r = requests.get("https://azetshop.ro/12-extensa?page="+format(i))

                    soup = BeautifulSoup(r.text, "html.parser")
                    array_price= soup.find_all('span', class_='price')
                    array_desc=soup.find_all('h1', class_='h3 product-title',text=True)
                    for iterator in range(0,len(array_price)):
                        descriere = array_desc[iterator].text.strip()
                        pret = array_price[iterator].text.strip()

                        writer.writerow([descriere, pret, datetime.datetime.now()]) 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM