如果其他HTML位于标记中，如何从div标记中提取python中的文本？

Question

I am trying to extract a ref. 我正在尝试提取参考。 id from HTML with scrapy: HTML中的scrapy的ID：

<div class="col" itemprop="description">
  <p>text Ref.&nbsp;<span>220.20.34.20.53.001</span></p>
  <p>more text</p>
</div>

The span and p tag are not always present. span和p标签并不总是存在。

Using xpath selector: 使用xpath选择器：

text = ' '.join(response.xpath('//div[@itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try: 
     ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()

Returns in this case only an empty string, as there is HTML inside the tag. 在这种情况下，仅返回一个空字符串，因为标记内有HTML。

Now trying to extract the text with CSS selector in order to use remove_tags: 现在尝试使用CSS选择器提取文本以使用remove_tags：

>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()])

This returns an empty result as I somehow can not grab the item. 由于我无法以某种方式抓取该项目，因此返回空结果。

How can I extract the ref_id regardless of having html  tags within the div or not. 无论div中是否包含html 标记，如何提取ref_id。 Some items of the crawl have no  tag and no  where my first attempt with xpath works. 我第一次尝试使用xpath时，某些爬网项目没有标记，也没有 。

Answer 1

Try to remove ::text from your last expression: 尝试从最后一个表达式中删除::text ：

''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()])

But if you need to extract only 220.20.34.20.53.001 from your html, why don't you use response.css('div[itemprop=description] p span::text').extract() ? 但是，如果您只需要从html中提取220.20.34.20.53.001 ，为什么不使用response.css('div[itemprop=description] p span::text').extract() ？

Or even response.css('div[itemprop=description]').re(r'([\\.\\d]+)') . 甚至是response.css('div[itemprop=description]').re(r'([\\.\\d]+)') 。

Answer 2

You don't need to use the remove_tags as you can get directly the text with the selectors: 您无需使用remove_tags因为您可以使用选择器直接获取text ：

sel.css('div[itemprop=description] ::text')

That will get all inner text from the div tag with itemprop="description" and later you can extract your information with a regex: 这将从div标记中获取所有带有itemprop="description"内部文本，随后您可以使用正则表达式提取信息：

sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')

如果其他HTML位于标记中，如何从div标记中提取python中的文本？

问题描述

2 个解决方案

解决方案1
1 2018-12-22 14:02:36

解决方案2
1 已采纳 2018-12-22 14:18:06

如果其他HTML位于标记中，如何从div标记中提取python中的文本？

问题描述

2 个解决方案

解决方案1 1 2018-12-22 14:02:36

解决方案2 1 已采纳 2018-12-22 14:18:06

解决方案1
1 2018-12-22 14:02:36

解决方案2
1 已采纳 2018-12-22 14:18:06