How to extract text in python from div tag if other html is within the tag?

Question

I am trying to extract a ref. id from HTML with scrapy:

<div class="col" itemprop="description">
  <p>text Ref.&nbsp;<span>220.20.34.20.53.001</span></p>
  <p>more text</p>
</div>

The span and p tag are not always present.

Using xpath selector:

text = ' '.join(response.xpath('//div[@itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try: 
     ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()

Returns in this case only an empty string, as there is HTML inside the tag.

Now trying to extract the text with CSS selector in order to use remove_tags:

>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()])

This returns an empty result as I somehow can not grab the item.

How can I extract the ref_id regardless of having html <p> tags within the div or not. Some items of the crawl have no <p> tag and no <span> where my first attempt with xpath works.

Answer 1

Try to remove ::text from your last expression:

''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()])

But if you need to extract only 220.20.34.20.53.001 from your html, why don't you use response.css('div[itemprop=description] p span::text').extract() ?

Or even response.css('div[itemprop=description]').re(r'([\\.\\d]+)') .

Answer 2

You don't need to use the remove_tags as you can get directly the text with the selectors:

sel.css('div[itemprop=description] ::text')

That will get all inner text from the div tag with itemprop="description" and later you can extract your information with a regex:

sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')

How to extract text in python from div tag if other html is within the tag?

Question

2 answers

solution1
1 2018-12-22 14:02:36

solution2
1 ACCPTED 2018-12-22 14:18:06

How to extract text in python from div tag if other html is within the tag?

Question

2 answers

solution1 1 2018-12-22 14:02:36

solution2 1 ACCPTED 2018-12-22 14:18:06

solution1
1 2018-12-22 14:02:36

solution2
1 ACCPTED 2018-12-22 14:18:06