如果其他HTML位於標記中，如何從div標記中提取python中的文本？

Question

我正在嘗試提取參考。 HTML中的scrapy的ID：

<div class="col" itemprop="description">
  <p>text Ref.&nbsp;<span>220.20.34.20.53.001</span></p>
  <p>more text</p>
</div>

span和p標簽並不總是存在。

使用xpath選擇器：

text = ' '.join(response.xpath('//div[@itemprop="description"]/p/text()').extract()).replace(u'\xa0', u' ')
try: 
     ref_id = re.findall(r"Ref\.? ?((?:[A-Z\d\.]+)|(?:[\d.]+))", text)[0].strip()

在這種情況下，僅返回一個空字符串，因為標記內有HTML。

現在嘗試使用CSS選擇器提取文本以使用remove_tags：

>>> ''.join([remove_tags(w).strip()for w in response.css('div[itemprop="description"]::text').extract()])

由於我無法以某種方式抓取該項目，因此返回空結果。

無論div中是否包含html <p>標記，如何提取ref_id。 我第一次嘗試使用xpath時，某些爬網項目沒有<p>標記，也沒有<span> 。

Answer 1

嘗試從最后一個表達式中刪除::text ：

''.join([remove_tags(w).strip() for w in response.css('div[itemprop=description]').extract()])

但是，如果您只需要從html中提取220.20.34.20.53.001 ，為什么不使用response.css('div[itemprop=description] p span::text').extract() ？

甚至是response.css('div[itemprop=description]').re(r'([\\.\\d]+)') 。

Answer 2

您無需使用remove_tags因為您可以使用選擇器直接獲取text ：

sel.css('div[itemprop=description] ::text')

這將從div標記中獲取所有帶有itemprop="description"內部文本，隨后您可以使用正則表達式提取信息：

sel.css('div[itemprop=description] ::text').re_first('(?:\d+.)+\d+')

如果其他HTML位於標記中，如何從div標記中提取python中的文本？

問題描述

2 個解決方案

解決方案1
1 2018-12-22 14:02:36

解決方案2
1 已采納 2018-12-22 14:18:06

如果其他HTML位於標記中，如何從div標記中提取python中的文本？

問題描述

2 個解決方案

解決方案1 1 2018-12-22 14:02:36

解決方案2 1 已采納 2018-12-22 14:18:06

解決方案1
1 2018-12-22 14:02:36

解決方案2
1 已采納 2018-12-22 14:18:06