简体   繁体   English

包含特定字符串的文本的Xpath表达式

[英]Xpath expression for text that contains a certain string

On the website http://www.apkmirror.com/apk/redditinc/reddit/reddit-1-5-5-release/reddit-1-5-5-android-apk-download/ , I'm trying to extract the lines containing the Min: and Target: versions of Android (see screenshot below). 在网站http://www.apkmirror.com/apk/redditinc/reddit/reddit-1-5-5-release/reddit-1-5-5-android-apk-download/上 ,我正在尝试提取包含Android的Min:Target:版本的行(请参见下面的屏幕截图)。

在此处输入图片说明

In the Scrapy shell, so far I've come up with the XPath expression 到目前为止,在Scrapy shell中,我已经提出了XPath表达式

In [1]: android_version = response.xpath('//*[@title="Android version"]/following-sibling::*[@class="appspec-value"]')

such that if I concatenate with .//text() and extract() , I get several lines including the ones I want: 这样,如果我将.//text()extract()连接起来, .//text()得到几行,包括我想要的行:

In [2]: android_version_text = android_version.xpath('.//text()').extract()

In [3]: android_version_text
Out[3]: 
[u'\n',
 u'Min: Android 4.0.3 (Ice Cream Sandwich MR1, API 15) ',
 u'\n',
 u'Target: Android 6.0 (Marshmallow, API 23)',
 u'\n']

I would now like to refine the XPath expression to get only fields with text() containing "Min:" or "Target: . Following XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode , I've tried 现在,我想优化XPath表达式以仅获取包含包含"Min:""Target: "Min:" text()字段。在XPath contains(text(),'some string')之后,当与具有更多内容的node一起使用时,将不起作用我尝试过一个Text子节点

In [7]: android_version.xpath('.//*[contains(text(), "Min:"]')

but this gives rise to a 但这引起了

ValueError: XPath error: Invalid expression in .//*[contains(text(), "Min:"]

How could I construct an XPath expression to get only the Min: line, for example? 例如,如何构造XPath表达式以仅获取Min:行?

Following https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/ , I came up with the following: https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/之后,我想到了以下内容:

In [12]: android_min_version = response.xpath('//*[@title="Android version"]/following-sibling::*[@class="appspec-value"]//text()[starts-with(., "Min:")]')

In [13]: android_min_version.extract()
Out[13]: [u'Min: Android 4.0.3 (Ice Cream Sandwich MR1, API 15) ']

in short, to filter the text you want you do an ordinary //text() followed by a [contains(., "target_string")] , where "target_string" is the string you are searching. 简而言之,要过滤所需的文本,请先执行普通的//text()后接[contains(., "target_string")] ,其中"target_string"是要搜索的字符串。 (Here I have also used starts-with instead of contains ). (在这里,我还使用了starts-with而不是contains )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM