I am trying to extract a ref. id from HTML with scrapy:
<div class="col" itemprop="description">
<p>text Ref. <span>220.20.34.20.53.001</span></p>
<p>more text</p>
</div>
The span and p tag are not always present.
Using xpath selector:
text = ' '.join(response.xpath('//div[@itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try:
ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()
Returns in this case only an empty string, as there is HTML inside the tag.
Now trying to extract the text with CSS selector in order to use remove_tags:
>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()])
This returns an empty result as I somehow can not grab the item.
How can I extract the ref_id regardless of having html <p>
tags within the div or not. Some items of the crawl have no <p>
tag and no <span>
where my first attempt with xpath works.
Try to remove ::text
from your last expression:
''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()])
But if you need to extract only 220.20.34.20.53.001
from your html, why don't you use response.css('div[itemprop=description] p span::text').extract()
?
Or even response.css('div[itemprop=description]').re(r'([\\.\\d]+)')
.
You don't need to use the remove_tags
as you can get directly the text
with the selectors:
sel.css('div[itemprop=description] ::text')
That will get all inner text from the div
tag with itemprop="description"
and later you can extract your information with a regex:
sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.