[英]python beautiful soup meta content tag
I'm trying to extract the price from a web site that includes the following HTML: 我试图从包含以下HTML的网站中提取价格:
<div class="book-block-price " itemprop="offers" itemtype="http://schema.org/Offer" itemscope>
<meta itemprop="price" content="29.99"/>
<meta itemprop="price" content=""/>
$ 29.99 </div>
I'm using the following Beautiful Soup code: 我正在使用以下美丽的汤代码:
book_prices = soup_packtpage.find_all(class_="book-block-price ")
print(book_prices)
for book_price in book_prices:
printable_version_price = book_price.meta.string
print(printable_version_price)
print(book_prices) yields: print(book_prices)产量:
[<div class="book-block-price " itemprop="offers" itemscope="" itemtype="http://schema.org/Offer">
<meta content="29.99" itemprop="price"/>
<meta content="" itemprop="price"/>
$ 29.99
print(printable_version_price) yields "None". print(printable_version_price)产生“无”。
How do I deal with meta tags? 我如何处理元标记? Or do I have other problems? 或者我还有其他问题吗?
The book_price.meta
would match the first meta
tag inside the book price block. book_price.meta
将匹配图书价格区块内的第一个meta
标记。 And this first meta
tag text is "empty" - this is why you are getting an empty string printed: 这第一个meta
标记文本是“空的” - 这就是为什么你打印一个空字符串:
<meta itemprop="price" content="29.99"/>
Instead, get the content
attribute value: 相反,获取content
属性值:
book_price.meta["content"]
You could probably do it with lxml
's etree
(pseudo-code, but should be enough to get you going): 你可以用lxml
的etree
(伪代码,但应该足以让你去)这样做:
from lxml import etree
doc = etree.parse(x) # where x is a file-like object, or parseString if x is a string.
print doc.xpath('//meta[itemprop="price"]/text()')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.