简体   繁体   English

python美丽汤元内容标记

[英]python beautiful soup meta content tag

I'm trying to extract the price from a web site that includes the following HTML: 我试图从包含以下HTML的网站中提取价格:

<div class="book-block-price " itemprop="offers" itemtype="http://schema.org/Offer" itemscope>
<meta itemprop="price" content="29.99"/>
<meta itemprop="price" content=""/>
    $ 29.99         </div>

I'm using the following Beautiful Soup code: 我正在使用以下美丽的汤代码:

book_prices = soup_packtpage.find_all(class_="book-block-price ")
print(book_prices)
for book_price in book_prices:
    printable_version_price = book_price.meta.string
    print(printable_version_price)

print(book_prices) yields: print(book_prices)产量:

[<div class="book-block-price " itemprop="offers" itemscope=""    itemtype="http://schema.org/Offer">
<meta content="29.99" itemprop="price"/>
<meta content="" itemprop="price"/>
            $ 29.99     

print(printable_version_price) yields "None". print(printable_version_price)产生“无”。

How do I deal with meta tags? 我如何处理元标记? Or do I have other problems? 或者我还有其他问题吗?

The book_price.meta would match the first meta tag inside the book price block. book_price.meta将匹配图书价格区块内的第一个meta标记。 And this first meta tag text is "empty" - this is why you are getting an empty string printed: 这第一个meta标记文本是“空的” - 这就是为什么你打印一个空字符串:

<meta itemprop="price" content="29.99"/>

Instead, get the content attribute value: 相反,获取content属性值:

book_price.meta["content"]

You could probably do it with lxml 's etree (pseudo-code, but should be enough to get you going): 你可以用lxmletree (伪代码,但应该足以让你去)这样做:

from lxml import etree
doc = etree.parse(x) # where x is a file-like object, or parseString if x is a string.
print doc.xpath('//meta[itemprop="price"]/text()')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM