[英]How to extract text which lies after <strong> tag in element
Trying to extract text from a element which looks like this: 尝试从元素中提取文本,如下所示:
<div><strong>"Beginning_of_text"</strong>"Rest_of_text"</div>
When I try to extract "Rest_of_text"
using Scrapy shell with 当我尝试使用Scrapy shell提取"Rest_of_text"
,
response.css("div::text").extraxt()
It gives me nothing. 它什么也没给我。 Do I have to use some special command to get to text that lies after a <strong>
tag inside an element? 我是否必须使用一些特殊命令来获取位于元素内<strong>
标记之后的文本?
仅对于“ Rest_of_text”,可以使用response.xpath('//div/strong/following-sibling::text()').get()
Given the text you provided, the command you've mentioned should've returned the following: 给定您提供的文本,您提到的命令应该返回以下内容:
['"Rest_of_text"']
The problem may occur if there is whitespace before strong
tag, eg: 如果在strong
标签之前有空格,则可能会出现此问题,例如:
<div> <strong>"Beginning_of_text"</strong>"Rest_of_text"</div>
In this case, if you execute the same command, you'll get this: 在这种情况下,如果执行相同的命令,则会得到以下信息:
[' ', '"Rest_of_text"']
But in case if there's nothing after the strong
tag, you'll get this: 但是,如果在strong
标签之后没有任何内容,您将得到以下信息:
[' ']
The best way to handle all these cases I know is to do the following: 处理我所知道的所有这些情况的最佳方法是执行以下操作:
>>> full_text = ''.join(response.xpath('//div//text()').extract())
>>> before_strong, after_strong = full_text.split(response.css('strong::text').extract_first())
So in the text you've provided, before_strong
will be equal to ''
and after_strong
will be equal to '"Rest_of_text"'
, which seems to be what you want to get. 因此,在您提供的文本中, before_strong
将等于''
, after_strong
将等于'"Rest_of_text"'
,这似乎就是您想要的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.