简体   繁体   English

我如何使用Python和BeautifulSoup从一个网络中的多个页面抓取数据

[英]How to scraping data from multiple pages in one web, I'm using Python and BeautifulSoup

   # -*- coding: utf-8 -*-
"""
Created on Fri Jun 29 10:38:46 2018

@author: Cinthia
"""

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
array = ['146-face', '153-palettes-sets', 'https://www.sociolla.com/147-eyes', 'https://www.sociolla.com/150-lips', 'https://www.sociolla.com/149-brows', 'https://www.sociolla.com/148-lashes']
base_url='https://www.sociolla.com/142-face'
uClient = uReq(base_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grab the product
kosmetik = page_soup.findAll("div", {"class":"col-md-3 col-sm-6 ipad-grid col-xs-12 productitem"})
print(len(kosmetik))

I want to scrape data from that website, that code above just take how much product on the base url. 我想从该网站上抓取数据,上面的代码仅在基本网址上占用了多少产品。 I don't know how that array will work, so it can take data from the product such as description, image, price from all the pages I make in the array. 我不知道该数组如何工作,因此它可以从产品中获取的数据(例如描述,图像,价格)从我在数组中创建的所有页面中获取。

I'm new to Python and don't know much about loops yet. 我是Python的新手,对循环了解不多。

You can find the root element of your table/grid which is id=product-list-grid here and extract the attributes which holds all the information you need (brand, link, category) and the first <img> tag. 您可以在此处找到表/网格的根元素id=product-list-grid ,并提取包含所需信息(品牌,链接,类别)和第一个<img>标记的属性。

For pagination, it seems you can get to the next page adding p=<page number> & when the page doesn't exist it redirects to the first one. 对于分页,似乎可以添加p=<page number>进入下一页,并且当该页面不存在时,它将重定向到第一页。 A workaround here is to check the response url and check if it's the same as the one you requested. 一种解决方法是检查响应URL,并检查其是否与您请求的URL相同。 If it's the same you can increment the page number otherwise you have scraped all the pages 如果相同,则可以增加页码,否则将所有页面刮掉

from bs4 import BeautifulSoup
import urllib.request

count = 1
url = "https://www.sociolla.com/142-nails?p=%d"

def get_url(url):
    req = urllib.request.Request(url)
    return urllib.request.urlopen(req)

expected_url = url % count
response = get_url(expected_url)

results = []

while (response.url == expected_url):
    print("GET {0}".format(expected_url))
    soup = BeautifulSoup(response.read(), "html.parser")

    products = soup.find("div", attrs = {"id" : "product-list-grid"})

    results.append([
        (
            t["data-eec-brand"],    #brand
            t["data-eec-category"], #category
            t["data-eec-href"],     #product link
            t["data-eec-name"],     #product name
            t["data-eec-price"],    #price
            t.find("img")["src"]    #image link
        ) 
        for t in products.find_all("div", attrs = {"class" : "product-item"})
        if t
    ])

    count += 1
    expected_url = url % count
    response = get_url(expected_url)

print(results)

Here the results are stored in results which is an array of tuples 这里的结果存储在results ,该results是一个元组数组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM