python美丽汤元内容标记

Question

I'm trying to extract the price from a web site that includes the following HTML: 我试图从包含以下HTML的网站中提取价格：

<div class="book-block-price " itemprop="offers" itemtype="http://schema.org/Offer" itemscope>
<meta itemprop="price" content="29.99"/>
<meta itemprop="price" content=""/>
    $ 29.99         </div>

I'm using the following Beautiful Soup code: 我正在使用以下美丽的汤代码：

book_prices = soup_packtpage.find_all(class_="book-block-price ")
print(book_prices)
for book_price in book_prices:
    printable_version_price = book_price.meta.string
    print(printable_version_price)

print(book_prices) yields: print（book_prices）产量：

[<div class="book-block-price " itemprop="offers" itemscope=""    itemtype="http://schema.org/Offer">
<meta content="29.99" itemprop="price"/>
<meta content="" itemprop="price"/>
            $ 29.99

print(printable_version_price) yields "None". print（printable_version_price）产生“无”。

How do I deal with meta tags? 我如何处理元标记？ Or do I have other problems? 或者我还有其他问题吗？

Answer 1

The book_price.meta would match the first meta tag inside the book price block. book_price.meta将匹配图书价格区块内的第一个meta标记。 And this first meta tag text is "empty" - this is why you are getting an empty string printed: 这第一个meta标记文本是“空的” - 这就是为什么你打印一个空字符串：

<meta itemprop="price" content="29.99"/>

Instead, get the content attribute value: 相反，获取content属性值：

book_price.meta["content"]

Answer 2

You could probably do it with lxml 's etree (pseudo-code, but should be enough to get you going): 你可以用lxml的etree （伪代码，但应该足以让你去）这样做：

from lxml import etree
doc = etree.parse(x) # where x is a file-like object, or parseString if x is a string.
print doc.xpath('//meta[itemprop="price"]/text()')

python美丽汤元内容标记

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-09-27 20:55:51

解决方案2
0 2015-09-27 20:56:21

python美丽汤元内容标记

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-09-27 20:55:51

解决方案2 0 2015-09-27 20:56:21

解决方案1
3 已采纳 2015-09-27 20:55:51

解决方案2
0 2015-09-27 20:56:21