简体   繁体   English

Python使用漂亮的汤从html提取属性

[英]Python using beautiful soup to extract attribute from html

I am trying to use the BeautifulSoup library in Python to extract the jpg image names from a html script. 我正在尝试使用Python中的BeautifulSoup库从HTML脚本中提取jpg图像名称。 In the url wherever you find srcset it is always proceeded by a jpg file name. srcset找到的url中,总是以jpg文件名开头。 I want to extract all the jpg files this way however whenever I run the following code it prints out None . 我想以这种方式提取所有jpg文件,但是,每当我运行以下代码时,它都会打印出None However in the url there is always a jpg file name after srcset. 但是,在srcset之后,URL中始终有一个jpg文件名。 For example , ' srcset="https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg" ' can be found in the html. 例如,可以在html中找到' srcset="https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg"

import urllib2 
html = urllib2.urlopen("https://www.shopstyle.com/p/prada-notch-lapel-fitted-blazer/645742403").read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

print soup.find(attrs= {"img":"srcset"})

Try this : 尝试这个 :

soup.find('img')['srcset']
'https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg'

I want to extract all the jpg files 我想提取所有jpg文件

from bs4 import BeautifulSoup
import requests

html_doc = requests.get("https://www.shopstyle.com/p/prada-notch-lapel-fitted-blazer/645742403")
soup = BeautifulSoup(html_doc.content, 'html.parser')
imgs = [i.get('srcset') for i in soup.find_all('img', srcset=True)]

print(imgs)

The output: 输出:

['https://img.shopstyle-cdn.com/pim/31/94/3194ec1ca5e3a56cb83f708533b9084d_best.jpg', 'https://img.shopstyle-cdn.com/pim/16/c3/16c3e46d3547d6404ba29b61b8f229fd_best.jpg', 'https://img.shopstyle-cdn.com/pim/65/e6/65e6d0e3c0160f0aca361934b999f0c9_best.jpg', 'https://img.shopstyle-cdn.com/sim/31/94/3194ec1ca5e3a56cb83f708533b9084d/prada-notch-lapel-fitted-blazer.jpg', 'https://img.shopstyle-cdn.com/sim/16/c3/16c3e46d3547d6404ba29b61b8f229fd/prada-notch-lapel-fitted-blazer.jpg', 'https://img.shopstyle-cdn.com/sim/65/e6/65e6d0e3c0160f0aca361934b999f0c9/prada-notch-lapel-fitted-blazer.jpg', 'https://img.shopstyle-cdn.com/pim/73/76/737689fa284d6640f7619e5f2f3558a5_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/2c/b0/2cb0acb147bd20df78bc482d66d7218b_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/5c/20/5c20824543749df684f3264c5e976e8c_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/48/b8/48b81f60d61e5c23cdfa343940e43ce9_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/ff/08/ff081818581b0363d4c0ec02c2cba5d4_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/86/0a/860ae7abdde0bf40046d53668abbe126_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/2f/5c/2f5c78d017052b14fd2db0d886a2a326_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/49/d5/49d5de5b62e6ddc0864afee987dd5e67_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/50/04/5004bf25e97ac0e4564d8a219a3b34b4_xlarge.jpg', 'https://img.shopstyle-cdn.com/pim/a8/76/a876ac6696e140f34e4cf82b5dbcaadf_xlarge.jpg']

To find all urls from srcset you can do this: 要从srcset查找所有URL,可以执行以下操作:

import urllib2 
html = urllib2.urlopen("https://www.shopstyle.com/p/prada-notch-lapel-fitted-blazer/645742403").read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

for el in soup.findAll('img', attrs = {'srcset' : True}):
    print el['srcset']

Your query returns None because argument attrs expected a dictionary with a property as key and filter as value. 您的查询返回None因为参数attrs期望使用属性作为键和值作为过滤器的字典。 See the explanation from bs4 docs 请参阅bs4文档中的说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用python美丽汤从html提取特定标签 - extract specific tags from html using python beautiful soup 使用美丽的汤python从html标记中提取信息 - Extract information from html tags using beautiful soup python 使用美丽的汤 python 从标签内部提取 html ID - extract a html ID from inside a tag using beautiful soup python 使用python美丽的汤提取html无效 - extract html using python beautiful soup is not working 如何使用 python 中的 Beautiful Soup 4 从属性 data-content = “” 的按钮中提取数据或数字,如下所示 - How to Extract data or number from button of attribute data-content = “ ” using Beautiful Soup 4 in python as show below 从.html文件中提取文本,删除HTML,然后使用Python和Beautiful Soup写入文本文件 - Extract text from .html file, remove HTML, and write to text file using Python and Beautiful Soup 如何用python和beautiful soup从html代码中提取一个小时 - How to extract an hour from html code with python and beautiful soup 如何使用 Beautiful Soup 从 HTML 中提取特定的脚本元素 - How to extract specific script element from HTML using Beautiful Soup 如何使用Beautiful Soup从HTML提取特定的URL? - How to extract specific URL from HTML using Beautiful Soup? 使用美丽的汤从 HTML 中提取特定的标题 - Extract a specific header from HTML using beautiful soup
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM