简体   繁体   English

具有中间元素的XPath for LXML

[英]XPath for LXML with Intermediary Element

I'm trying to scrape some pages with python and LXML. 我正在尝试使用python和LXML抓取一些页面。 My test page is http://www.sarpy.com/oldterra/prop/PDisplay3.asp?ParamValue1=010558233 我的测试页是http://www.sarpy.com/oldterra/prop/PDisplay3.asp?ParamValue1=010558233

I'm having good luck with most of the XPaths. 我对大多数XPath都很满意。 For example, 例如,

tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../tr[3]/td[1]/text()')

successfully gets me the date of the first sale listed. 成功地给我列出了第一笔交易的日期。 I have several other pieces too. 我还有其他几件。 However, I cannot get the B&P listed under the sale date. 但是,我无法获得销售日期下方列出的B&P。 For example the B&P of the first sale is 200639333. 例如,第一次销售的B&P是200639333。

I notice in the page structure that there is a form element preceding the tr of the B&P item. 我注意到在页面结构中,在B&P项目的tr之前有一个form元素。 Since it's the next table row, I tried incrementing the tr index as follows: 由于它是下一个表行,因此我尝试按以下方式增加tr索引:

tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../tr[4]/td[1]/text()')

That returns: 返回:

['\r\n           ']

Because of the line breaks and sub element of br and input within the field, I tried making text() into text()[1], text()[2], etc., but no luck. 由于换行符和br的子元素以及字段中的输入,我尝试将text()变成text()[1],text()[2]等,但是没有运气。

I tried to base the path off of the adjacent form like this: 我试图将路径基于相邻的表单,如下所示:

tree.xpath('/html/body/table[7]/form[@action="../rod/ImageDisplay.asp"]/following-sibling::tr/td[1]/text()')

No luck. 没运气。

I figure there are two potential issues: the intermediary form elements that may be breaking the indexing patterns, and the whitespace. 我认为存在两个潜在问题:可能破坏索引模式的中间表单元素和空白。 I'd appreciate any help in correcting this xpath. 我将感谢您提供任何纠正此xpath的帮助。

The <tr> you are looking for is the child of the <form> , not its sibling , try - 您要查找的<tr><form> ,而不是其同级,请尝试-

tree.xpath('/html/body/table/tr[1]/td[contains(text(), "Sales Information")]/../../form[1]/td[1]/text()')

This may get you 200639333 with a lot of whitespaces. 这可能会使您具有大量空白的200639333

Or - 要么 -

tree.xpath('/html/body/table[7]/form[@action="../rod/ImageDisplay.asp"]/tr[1]/td[1]/text()')

For all such elements. 对于所有此类元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM