简体   繁体   中英

xpath: how to extract text before, AND within, AND after the <strong> element

I am working on a Scrapy spider, in which xpath is used to extract information needed. The source page was first generated by using the website's search function. For example, my interest is to get the items with "computer" in the title. On the source page, all the "computer" is in bold because of the search process. And "computer" could be in the beginning, or the middle or the end of the titles. Some items don't have "computer" in the title. See the examples below:

Example 1: ("computer" at the beginning)
<a class="title" href="whatever1">
<strong> Computer </strong>
, used
</a>  

Example 2: ("computer" in the middle)
<a class="title" href="whatever2">
Low price
<strong> computer </strong>
, great deal
</a> 

Example 3: ("computer" at the end)
<a class="title" href="whatever3">
Don't miss this
<strong> Computer </strong>
</a>

Example 4: (no keyword of "computer")
<a class="title" href="whatever4">
Best laptop deal ever!      
</a>

The xpath code I tried .//a[@class="title"]/text() will only generate the portion AFTER the strong element. For the above 4 examples, I will get the following results:

Example 1:
, used

Example 2:
, great deal

Example 3: (Nothing)


Example 4:
Best laptop deal ever!

I need a xpath code to cover all these four situation and collect the full titles of each item.

The simplest approach would be to search for all "text" nodes and "join" them:

"".join(response.xpath('.//a[@class="title"]//text()').extract())

Note the double slash before the text() this is the key fix here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM