简体   繁体   English

xpath:如何在<strong>元素之前,之后和之后提取文本

[英]xpath: how to extract text before, AND within, AND after the <strong> element

I am working on a Scrapy spider, in which xpath is used to extract information needed. 我正在研究Scrapy蜘蛛,其中xpath用于提取所需的信息。 The source page was first generated by using the website's search function. 源页面首先使用网站的搜索功能生成。 For example, my interest is to get the items with "computer" in the title. 例如,我的兴趣是在标题中获得带有“计算机”的项目。 On the source page, all the "computer" is in bold because of the search process. 在源页面上,由于搜索过程,所有“计算机”都以粗体显示 And "computer" could be in the beginning, or the middle or the end of the titles. 并且“计算机”可能位于标题的开头,中间或末尾。 Some items don't have "computer" in the title. 有些项目标题中没有“计算机”。 See the examples below: 请参阅以下示例:

Example 1: ("computer" at the beginning)
<a class="title" href="whatever1">
<strong> Computer </strong>
, used
</a>  

Example 2: ("computer" in the middle)
<a class="title" href="whatever2">
Low price
<strong> computer </strong>
, great deal
</a> 

Example 3: ("computer" at the end)
<a class="title" href="whatever3">
Don't miss this
<strong> Computer </strong>
</a>

Example 4: (no keyword of "computer")
<a class="title" href="whatever4">
Best laptop deal ever!      
</a>

The xpath code I tried .//a[@class="title"]/text() will only generate the portion AFTER the strong element. 我试过的xpath代码.//a[@class="title"]/text()只生成strong元素之后的部分。 For the above 4 examples, I will get the following results: 对于上面的4个例子,我将得到以下结果:

Example 1:
, used

Example 2:
, great deal

Example 3: (Nothing)


Example 4:
Best laptop deal ever!

I need a xpath code to cover all these four situation and collect the full titles of each item. 我需要一个xpath代码来涵盖所有这四种情况并收集每个项目的完整标题。

The simplest approach would be to search for all "text" nodes and "join" them: 最简单的方法是搜索所有“文本”节点并“加入”它们:

"".join(response.xpath('.//a[@class="title"]//text()').extract())

Note the double slash before the text() this is the key fix here. 注意text()之前的双斜杠这是这里的关键修复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM