[英]Can't scrape category titles from a webpage
I've written a scraper in python to get different category names from a webpage but it is unable to fetch anything from that page. 我用python编写了一个scraper,以从网页中获取不同的类别名称,但它无法从该网页中获取任何内容。 I'm seriously confused not to be able to figure out where i'm going wrong. 我很困惑,无法找出我要去哪里。 Any help would be vastly appreciated. 任何帮助将不胜感激。
Here is the link to the webpage: URL 这是网页的链接: URL
Here is what I've tried so far: 到目前为止,这是我尝试过的:
from bs4 import BeautifulSoup
import requests
res = requests.get("replace_with_above_url",headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select('.slide_container .h3.standardTitle'):
print(items.text)
Elements within which one such category names I'm after: 我在其中使用这样一个类别名称的元素:
<div class="slide_container">
<a href="/offers/furniture/" tabindex="0">
<picture style="float: left; width: 100%;"><img style="width:100%" src="/_m4/9/8/1513184943_4413.jpg" data-w="270"></picture>
<div class="floated-details inverted" style="height: 69px;">
<div class="h3 margin-top-sm margin-bottom-sm standardTitle">
Furniture Offers #This is the name I'm after
</div>
<p class="carouselDesc">
</p>
</div>
</a>
</div>
from bs4 import BeautifulSoup
import requests
headers = {
'accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-encoding':'gzip, deflate, br',
'accept-language':'en-US,en;q=0.9',
'cache-control':'max-age=0',
'referer':'https://www.therange.co.uk/',
'upgrade-insecure-requests':'1',
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
}
res = requests.get("https://www.therange.co.uk/",headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
for items in soup.select('.slide_container .h3.standardTitle'):
print(items.text)
Try this 尝试这个
a user-agent is not enough because headers are the most important part of scrapping.if you miss any header then server ll treat you as a bot. 用户代理是不够的,因为标头是报废的最重要部分。如果您错过任何标头,则服务器会将您视为机器人。
使用"html.parser"
而不是"lxml"
soup = BeautifulSoup(res.text,"html.parser")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.