简体   繁体   中英

How to extract text which lies after <strong> tag in element

Trying to extract text from a element which looks like this:

<div><strong>"Beginning_of_text"</strong>"Rest_of_text"</div>

When I try to extract "Rest_of_text" using Scrapy shell with

response.css("div::text").extraxt()

It gives me nothing. Do I have to use some special command to get to text that lies after a <strong> tag inside an element?

仅对于“ Rest_of_text”,可以使用response.xpath('//div/strong/following-sibling::text()').get()

Given the text you provided, the command you've mentioned should've returned the following:

['"Rest_of_text"']

The problem may occur if there is whitespace before strong tag, eg:

<div>   <strong>"Beginning_of_text"</strong>"Rest_of_text"</div>

In this case, if you execute the same command, you'll get this:

['   ', '"Rest_of_text"']

But in case if there's nothing after the strong tag, you'll get this:

['   ']

The best way to handle all these cases I know is to do the following:

>>> full_text = ''.join(response.xpath('//div//text()').extract())
>>> before_strong, after_strong = full_text.split(response.css('strong::text').extract_first())

So in the text you've provided, before_strong will be equal to '' and after_strong will be equal to '"Rest_of_text"' , which seems to be what you want to get.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM