Python使用漂亮的汤从html提取属性

Question

I am trying to use the BeautifulSoup library in Python to extract the jpg image names from a html script. 我正在尝试使用Python中的BeautifulSoup库从HTML脚本中提取jpg图像名称。 In the url wherever you find srcset it is always proceeded by a jpg file name. 在srcset找到的url中，总是以jpg文件名开头。 I want to extract all the jpg files this way however whenever I run the following code it prints out None . 我想以这种方式提取所有jpg文件，但是，每当我运行以下代码时，它都会打印出None 。 However in the url there is always a jpg file name after srcset. 但是，在srcset之后，URL中始终有一个jpg文件名。 For example , ' srcset="https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg" ' can be found in the html. 例如，可以在html中找到' srcset="https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg" 。

import urllib2 
html = urllib2.urlopen("https://www.shopstyle.com/p/prada-notch-lapel-fitted-blazer/645742403").read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

print soup.find(attrs= {"img":"srcset"})

Answer 1

Try this : 尝试这个：

soup.find('img')['srcset']
'https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg'

Answer 2

I want to extract all the jpg files 我想提取所有jpg文件

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://www.shopstyle.com/p/prada-notch-lapel-fitted-blazer/645742403")
soup = BeautifulSoup(html_doc.content, 'html.parser')
imgs = [i.get('srcset') for i in soup.find_all('img', srcset=True)]

print(imgs)

The output: 输出：

['https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg', 'https://img.shopstyle-cdn.com/pim/16/c3/16c3e46d3547d6404ba29b61b8f229fd_best.jpg', 'https://img.shopstyle-cdn.com/pim/65/e6/65e6d0e3c0160f0aca361934b999f0c9_best.jpg', 'https://img.shopstyle-cdn.com/sim/31/94/3194ec1ca5e3a56cb83f708533b9084d/prada-notch-lapel-fitted-blazer.jpg', 'https://img.shopstyle-cdn.com/sim/16/c3/16c3e46d3547d6404ba29b61b8f229fd/prada-notch-lapel-fitted-blazer.jpg', 'https://img.shopstyle-cdn.com/sim/65/e6/65e6d0e3c0160f0aca361934b999f0c9/prada-notch-lapel-fitted-blazer.jpg', 'https://img.shopstyle-cdn.com/pim/73/76/737689fa284d6640f7619e5f2f3558a5_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/2c/b0/2cb0acb147bd20df78bc482d66d7218b_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/5c/20/5c20824543749df684f3264c5e976e8c_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/48/b8/48b81f60d61e5c23cdfa343940e43ce9_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/ff/08/ff081818581b0363d4c0ec02c2cba5d4_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/86/0a/860ae7abdde0bf40046d53668abbe126_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/2f/5c/2f5c78d017052b14fd2db0d886a2a326_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/49/d5/49d5de5b62e6ddc0864afee987dd5e67_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/50/04/5004bf25e97ac0e4564d8a219a3b34b4_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/a8/76/a876ac6696e140f34e4cf82b5dbcaadf_xlarge.jpg']

Answer 3

To find all urls from srcset you can do this: 要从srcset查找所有URL，可以执行以下操作：

import urllib2 
html = urllib2.urlopen("https://www.shopstyle.com/p/prada-notch-lapel-fitted-blazer/645742403").read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for el in soup.findAll('img', attrs = {'srcset' : True}):
    print el['srcset']

Your query returns None because argument attrs expected a dictionary with a property as key and filter as value. 您的查询返回None因为参数attrs期望使用属性作为键和值作为过滤器的字典。 See the explanation from bs4 docs 请参阅bs4文档中的说明

Python使用漂亮的汤从html提取属性

问题描述

3 个解决方案

解决方案1
2 2017-10-05 09:55:03

解决方案2
2 2017-10-05 10:00:31

解决方案3
1 已采纳 2017-10-05 09:57:50

Python使用漂亮的汤从html提取属性

问题描述

3 个解决方案

解决方案1 2 2017-10-05 09:55:03

解决方案2 2 2017-10-05 10:00:31

解决方案3 1 已采纳 2017-10-05 09:57:50

解决方案1
2 2017-10-05 09:55:03

解决方案2
2 2017-10-05 10:00:31

解决方案3
1 已采纳 2017-10-05 09:57:50