我如何使用Python和BeautifulSoup從一個網絡中的多個頁面抓取數據

Question

   # -*- coding: utf-8 -*-
"""
Created on Fri Jun 29 10:38:46 2018

@author: Cinthia
"""

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
array = ['146-face', '153-palettes-sets', 'https://www.sociolla.com/147-eyes', 'https://www.sociolla.com/150-lips', 'https://www.sociolla.com/149-brows', 'https://www.sociolla.com/148-lashes']
base_url='https://www.sociolla.com/142-face'
uClient = uReq(base_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grab the product
kosmetik = page_soup.findAll("div", {"class":"col-md-3 col-sm-6 ipad-grid col-xs-12 productitem"})
print(len(kosmetik))

我想從該網站上抓取數據，上面的代碼僅在基本網址上占用了多少產品。 我不知道該數組如何工作，因此它可以從產品中獲取的數據（例如描述，圖像，價格）從我在數組中創建的所有頁面中獲取。

我是Python的新手，對循環了解不多。

Answer 1

您可以在此處找到表/網格的根元素id=product-list-grid ，並提取包含所需信息（品牌，鏈接，類別）和第一個<img>標記的屬性。

對於分頁，似乎可以添加p=<page number>進入下一頁，並且當該頁面不存在時，它將重定向到第一頁。 一種解決方法是檢查響應URL，並檢查其是否與您請求的URL相同。 如果相同，則可以增加頁碼，否則將所有頁面刮掉

from bs4 import BeautifulSoup
import urllib.request

count = 1
url = "https://www.sociolla.com/142-nails?p=%d"

def get_url(url):
    req = urllib.request.Request(url)
    return urllib.request.urlopen(req)

expected_url = url % count
response = get_url(expected_url)

results = []

while (response.url == expected_url):
    print("GET {0}".format(expected_url))
    soup = BeautifulSoup(response.read(), "html.parser")

    products = soup.find("div", attrs = {"id" : "product-list-grid"})

    results.append([
        (
            t["data-eec-brand"],    #brand
            t["data-eec-category"], #category
            t["data-eec-href"],     #product link
            t["data-eec-name"],     #product name
            t["data-eec-price"],    #price
            t.find("img")["src"]    #image link
        ) 
        for t in products.find_all("div", attrs = {"class" : "product-item"})
        if t
    ])

    count += 1
    expected_url = url % count
    response = get_url(expected_url)

print(results)

這里的結果存儲在results ，該results是一個元組數組

我如何使用Python和BeautifulSoup從一個網絡中的多個頁面抓取數據

問題描述

1 個解決方案

解決方案1
0 已采納 2018-06-30 22:07:51

我如何使用Python和BeautifulSoup從一個網絡中的多個頁面抓取數據

問題描述

1 個解決方案

解決方案1 0 已采納 2018-06-30 22:07:51

解決方案1
0 已采納 2018-06-30 22:07:51