![](/img/trans.png)
[英]Cannot extract text from webpage using beautifulsoup with python
[英]Cannot get images from webpage with high resolution using BeautifulSoup and Python
我有一个包含计算机游戏的链接,对于每个游戏,我想提取最高分辨率的产品图像,而不是所有img标签。 到目前为止,我有:
#GET ALL IMG TAGS
img_tags = soup2.find_all('img')
#CREATE LIST WITH IMG TAGS
urls_img = [img['src'] for img in img_tags]
#CHECK EACH IMG TAG
for murl in urls_img:
filename = re.search(r'/([\w_-]+[.](jpg|png))$', murl)
if filename is not None:
with open(filename.group(1), 'wb') as f:
if 'http' not in murl:
murl = '{}{}'.format(site, murl)
#GET THE RESPONSE OF IMG URL
response = requests.get(murl)
if response.status_code == 200:
f.write(response.content)
编辑:经过讨论,以下内容将获取初始产品网址(不包括占位符),并访问每个页面以查找最大的图像。 最大的图像具有属性['data-large_image']
。
我使用Session
来提高效率,重用连接。
import requests
from bs4 import BeautifulSoup as bs
url = 'http://zegetron.gr/b2b/product-category/pc/?products-per-page=all'
images = []
with requests.Session() as s:
r = s.get(url)
soup = bs(r.content, 'lxml')
product_links = [item.select_one('a')['href'] for item in soup.select('.product-wrapper') if item.select_one('[src]:not(.woocommerce-placeholder)')]
for link in product_links:
r = s.get(link)
soup = bs(r.content, 'lxml')
images.append(soup.select_one('[data-large_image]')['data-large_image'])
以前的答案(基于所有产品的原始单个网址):
尝试以下方法,在每个清单中查找srcset
属性。 如果存在,它将采用列出的最后一个src
链接(因为它们按升序排列),否则,将采用src
属性。
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('http://zegetron.gr/b2b/product-category/pc/?products-per-page=all')
soup = bs(r.content, 'lxml')
listings = soup.select('.thumb-wrapper')
images = []
for listing in listings:
link = ''
if listing.select_one(':has([srcset])'):
links = listing.select_one('[srcset]')['srcset']
link = links.split(',')[-1]
link = link.split(' ')[1]
else:
if listing.select_one('[src]:not(.woocommerce-placeholder)'):
link = listing.select_one('img[src]')['src']
if link:
images.append(link)
print(images)
我发现这也许更容易,并解决了我的问题
for each_img_tag in img_tags:
width = each_img_tag.get('width')
if width is not None and int(width)>500:
urls_img.append(each_img_tag['src'])
即使我不知道它是否更快
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.