简体   繁体   English

我究竟做错了什么? 使用lxml解析HTML

[英]What am I doing wrong? Parsing HTML using lxml

I'm trying to parse a webpage using lxml and I'm having trouble trying to bring back all the text elements within a div. 我正在尝试使用lxml解析网页,我在尝试恢复div中的所有文本元素时遇到问题。 Here's what I have so far... 这是我到目前为止所拥有的......

import requests
from lxml import html
page = requests.get("https://www.goodeggs.com/sfbay/missionheirloom/seasonal-chicken-stew-16oz/53c68de974e06f020000073f",verify=False)
tree = html.fromstring(page.text)
foo = tree.xpath('//section[@class="product-description"]/div[@class="description-body"]/text()')

As of now "foo" brings back an empty list []. 截至目前,“foo”带回一个空列表[]。 Other pages bring back some content, but not all of the content that is in tags within the <div> . 其他页面带回一些内容,但不包含<div>中标签内的所有内容。 Other pages bring back all the content, because it is at the top level of the div. 其他页面带回所有内容,因为它位于div的顶层。

How do I bring back all of the text content within that div? 如何恢复该div中的所有文本内容? Thanks! 谢谢!

The text is inside two <p> tags, so part of the text is in each p.text instead of in div.text . text位于两个<p>标记内,因此部分文本位于每个p.text而不是div.text However, you can pull all the text in all the children of <div> by calling the text_content method instead of using the XPath text() : 但是,你可以把所有的文字在所有孩子<div>通过调用text_content方法,而不是使用XPath text()

import requests
import lxml.html as LH
url = ("https://www.goodeggs.com/sfbay/missionheirloom/" 
       "seasonal-chicken-stew-16oz/53c68de974e06f020000073f")
page = requests.get(url, verify=False)
root = LH.fromstring(page.text)

path = '//section[@class="product-description"]/div[@class="description-body"]'
for div in root.xpath(path):
    print(div.text_content())

yields 产量

We’re super excited about the changing seasons! Because the new season brings wonderful new ingredients, we’ll be changing the flavor profile of our stews. Starting with deliveries on Thursday October 9th, the Chicken and Wild Rice stew will be replaced with a Classic Chicken Stew. We’re sure you’ll love it!Mission: Heirloom is a food company based in Berkeley. All of our food is sourced as locally as possible and 100% organic or biodynamic. We never cook with refined oils, and our food is always gluten-free, grain-free, soy-free, peanut-free, legume-free, and added sugar-free.

PS. PS。 dfsq has already suggest using the XPath ...//text() . dfsq已经建议使用XPath ...//text() That also works, but in contrast to text_content , the pieces of text are returned as separate items: 这也有效,但与text_content相反,文本片段作为单独的项返回:

In [256]: root = LH.fromstring('<a>FOO <b>BAR <c>QUX</c> </b> BAZ</a>')

In [257]: root.xpath('//a//text()')
Out[257]: ['FOO ', 'BAR ', 'QUX', ' ', ' BAZ']

In [258]: [a.text_content() for a in root.xpath('//a')]
Out[258]: ['FOO BAR QUX  BAZ']

I think XPath expression should be: 我认为XPath表达式应该是:

//section[@class="product-description"]/div[@class="description-body"]//text()

UPD. UPD。 As pointed by @unutbu above expression will fetch text nodes as a list, so you will have to loop over them. 正如@unutbu上面指出的那样,表达式会将文本节点作为列表获取,因此您必须循环它们。 If you need entire text content as one text item, check unutbu's answer for other options. 如果您需要将整个文本内容作为一个文本项,请检查unutbu的其他选项的答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM