簡體   English   中英

似乎無法從網頁中抓取特定信息?

[英]Can't seem to scrape specific information from webpage?

我正在嘗試為以下頁面上顯示的每個項目抓取一些信息: https ://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+ Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&

但是,我似乎無法訪問項目信息。 我需要的信息是每個產品的名稱和鏈接,例如第一項包含在:

<a class="catalog_item_name" aria-hidden="true" tabindex="-1" id="WC_CatalogEntryDBThumbnailDisplayJSPF_3074457345616901168_link_9b" href="/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New+Spirits&amp;categoryType=Spirits&amp;fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0">Woodford Reserve Master Collection Five Malt Stouted Mash</a>

所以我試圖抓取的信息是:

Woodford Reserve Master Collection Five Malt Stouted Mash

/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New+Spirits&amp;categoryType=Spirits&amp;fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0

我正在嘗試對頁面上的每個項目進行迭代。 我肯定會連接到該頁面,但由於某種原因,我無法使用for product in soup.select抓取任何信息 下面是我的腳本的簡化版本,我一直在嘗試從上面的 catalog_item_name 收集信息

import requests
import sys
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
from datetime import datetime
import json
import random
import requests
from itertools import cycle
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from urllib3.exceptions import InsecureRequestWarning

from requests_html import HTMLSession
session = HTMLSession()


user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for i in range(1,4):
    #Pick a random user agent
    user_agent = random.choice(user_agent_list)



url = []
url = 'https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&'

response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,features="html.parser")
link = []

for product in soup.select('a.catalog_item_name'):
    link.append(product)

print(link)

任何幫助將不勝感激!

編輯:用另外兩個網站測試了腳本,它工作得很好。 網站上一定有什么東西讓你失望了?

我想這里最好的方法是檢查網絡流量並直接查詢 API。 例如,對於上面的 url,在https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CategoryProductsListingView有一些針對 API 的POST請求。

我可以用它來獲取產品列表,即:

from bs4 import BeautifulSoup
import requests
import urllib

base_url = 'https://www.finewineandgoodspirits.com'
path = '/webapp/wcs/stores/servlet/CategoryProductsListingView?sType=SimpleSearch&resultsPerPage=15&sortBy=0&disableProductCompare=false&ajaxStoreImageDir=%2fwcsstore%2fWineandSpirits%2f&variety=New+Spirits&categoryType=Spirits&ddkey=ProductListingView'
params = {
    'storeId': '10051',
    'categoryId': '1351370',
    'searchType': '1002'
}

headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'some super fancy browser',
}

request_url = base_url + path + '&' + urllib.parse.urlencode(params)
response = requests.post(request_url, headers=headers)
soup = BeautifulSoup(response.text)

# now, extract the content form the soup, eg like you did above
product_links: list[str] = [base_url + a['href'] for a in soup.select('a.catalog_item_name')]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM