![](/img/trans.png)
[英]How to loop & scraping data for multiple pages using python and beautifulsoup4
[英]How to scraping data from multiple pages in one web, I'm using Python and BeautifulSoup
# -*- coding: utf-8 -*-
"""
Created on Fri Jun 29 10:38:46 2018
@author: Cinthia
"""
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
array = ['146-face', '153-palettes-sets', 'https://www.sociolla.com/147-eyes', 'https://www.sociolla.com/150-lips', 'https://www.sociolla.com/149-brows', 'https://www.sociolla.com/148-lashes']
base_url='https://www.sociolla.com/142-face'
uClient = uReq(base_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grab the product
kosmetik = page_soup.findAll("div", {"class":"col-md-3 col-sm-6 ipad-grid col-xs-12 productitem"})
print(len(kosmetik))
我想從該網站上抓取數據,上面的代碼僅在基本網址上占用了多少產品。 我不知道該數組如何工作,因此它可以從產品中獲取的數據(例如描述,圖像,價格)從我在數組中創建的所有頁面中獲取。
我是Python的新手,對循環了解不多。
您可以在此處找到表/網格的根元素id=product-list-grid
,並提取包含所需信息(品牌,鏈接,類別)和第一個<img>
標記的屬性。
對於分頁,似乎可以添加p=<page number>
進入下一頁,並且當該頁面不存在時,它將重定向到第一頁。 一種解決方法是檢查響應URL,並檢查其是否與您請求的URL相同。 如果相同,則可以增加頁碼,否則將所有頁面刮掉
from bs4 import BeautifulSoup
import urllib.request
count = 1
url = "https://www.sociolla.com/142-nails?p=%d"
def get_url(url):
req = urllib.request.Request(url)
return urllib.request.urlopen(req)
expected_url = url % count
response = get_url(expected_url)
results = []
while (response.url == expected_url):
print("GET {0}".format(expected_url))
soup = BeautifulSoup(response.read(), "html.parser")
products = soup.find("div", attrs = {"id" : "product-list-grid"})
results.append([
(
t["data-eec-brand"], #brand
t["data-eec-category"], #category
t["data-eec-href"], #product link
t["data-eec-name"], #product name
t["data-eec-price"], #price
t.find("img")["src"] #image link
)
for t in products.find_all("div", attrs = {"class" : "product-item"})
if t
])
count += 1
expected_url = url % count
response = get_url(expected_url)
print(results)
這里的結果存儲在results
,該results
是一個元組數組
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.