简体   繁体   English

使用lxml / xpath()从站点抓取文本时出现问题

[英]Trouble with scraping text from site using lxml / xpath()

quick one. 快速的。 I'm new to using lxml and have spent quite a while trying to scrape text data from a particular site. 我是使用lxml的新手,并且花了很长时间尝试从特定站点抓取文本数据。 The element structure is as shown below: 元素结构如下所示:

http://tinypic.com/r/2iw7zaa/8 http://tinypic.com/r/2iw7zaa/8

What i want to do is extract the 100,100 that is shown within the highlighted area. 我要做的是提取突出显示区域中显示的100,100。 The statements i've tried include (I saved the source of the site into a text file to test, test.txt - tried also with html extension): 我尝试过的语句包括(我将网站的源保存到一个文本文件中进行测试,即test.txt-也尝试了html扩展名):

from lxml import html
tree = html.parse(test.txt)
#value = tree.xpath('//*[@id="content"]/table[4]/tbody/tr[1]/td[2]')
#value = tree.xpath('//*[@id="content"]/table[4]/tbody/tr[1]/td[2]/text()')

All i seem to get as a result is an empty list [] ,any help would be greatly appreciated. 结果,我似乎得到的只是一个空列表[],我们将不胜感激。

ps i commented out the two value statements as I'm showing what i tried. 附言:当我展示我尝试的内容时,我注释掉了两个值语句。 I tried a bunch of other xpath statements similiar to the ones above but they were lost as the python shell crashed on me. 我尝试了许多其他类似上面的xpath语句,但是由于python shell崩溃了,它们丢失了。

pps. PPS。 apologies for the link to the pic - due to rep I can't post the pic directly. 对图片链接的道歉-由于代表,我无法直接发布图片。

Try removing '/tbody' from the xpath. 尝试从xpath中删除“ / tbody”。

The browser might be adding the `/tbody' tag whereas it might not appear in the raw HTML. 浏览器可能会添加`/ tbody'标签,而它可能不会出现在原始HTML中。

Read more here and here . 在这里这里阅读更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM