繁体   English   中英

似乎无法从网页中抓取特定信息?

[英]Can't seem to scrape specific information from webpage?

我正在尝试为以下页面上显示的每个项目抓取一些信息: https ://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+ Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&

但是,我似乎无法访问项目信息。 我需要的信息是每个产品的名称和链接,例如第一项包含在:

<a class="catalog_item_name" aria-hidden="true" tabindex="-1" id="WC_CatalogEntryDBThumbnailDisplayJSPF_3074457345616901168_link_9b" href="/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New+Spirits&amp;categoryType=Spirits&amp;fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0">Woodford Reserve Master Collection Five Malt Stouted Mash</a>

所以我试图抓取的信息是:

Woodford Reserve Master Collection Five Malt Stouted Mash

/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&amp;storeId=10051&amp;productId=3074457345616901168&amp;langId=-1&amp;partNumber=000086630prod&amp;errorViewName=ProductDisplayErrorView&amp;categoryId=1351370&amp;top_category=25208&amp;parent_category_rn=25208&amp;urlLangId=&amp;variety=New+Spirits&amp;categoryType=Spirits&amp;fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0

我正在尝试对页面上的每个项目进行迭代。 我肯定会连接到该页面,但由于某种原因,我无法使用for product in soup.select抓取任何信息 下面是我的脚本的简化版本,我一直在尝试从上面的 catalog_item_name 收集信息

import requests
import sys
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
from datetime import datetime
import json
import random
import requests
from itertools import cycle
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from urllib3.exceptions import InsecureRequestWarning

from requests_html import HTMLSession
session = HTMLSession()


user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for i in range(1,4):
    #Pick a random user agent
    user_agent = random.choice(user_agent_list)



url = []
url = 'https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&'

response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,features="html.parser")
link = []

for product in soup.select('a.catalog_item_name'):
    link.append(product)

print(link)

任何帮助将不胜感激!

编辑:用另外两个网站测试了脚本,它工作得很好。 网站上一定有什么东西让你失望了?

我想这里最好的方法是检查网络流量并直接查询 API。 例如,对于上面的 url,在https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CategoryProductsListingView有一些针对 API 的POST请求。

我可以用它来获取产品列表,即:

from bs4 import BeautifulSoup
import requests
import urllib

base_url = 'https://www.finewineandgoodspirits.com'
path = '/webapp/wcs/stores/servlet/CategoryProductsListingView?sType=SimpleSearch&resultsPerPage=15&sortBy=0&disableProductCompare=false&ajaxStoreImageDir=%2fwcsstore%2fWineandSpirits%2f&variety=New+Spirits&categoryType=Spirits&ddkey=ProductListingView'
params = {
    'storeId': '10051',
    'categoryId': '1351370',
    'searchType': '1002'
}

headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'some super fancy browser',
}

request_url = base_url + path + '&' + urllib.parse.urlencode(params)
response = requests.post(request_url, headers=headers)
soup = BeautifulSoup(response.text)

# now, extract the content form the soup, eg like you did above
product_links: list[str] = [base_url + a['href'] for a in soup.select('a.catalog_item_name')]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM