XPath Scrapy 在 class 中加入由 br 標記分隔的文本節點

Question

我正在使用 Python、Xpath 和 ZA5A0DA3C8F396C05CAZDC385D73 學習 web 刮擦。 我堅持以下。 如果你能幫助我，我將不勝感激。

這是 HTML 代碼

<div class="discussionpost">
“This is paragraph one.”
<br>
<br>
“This is paragraph two."'
<br>
<br>
"This is paragraph three.”
</div>

這是 output 我想得到：“這是第一段。這是第二段。這是第三段。” 我想合並由<br>分隔的所有段落。 沒有<p>標簽。

但是，我得到的 output 是：“這是第一句。”，“這是第二句。”，“這是第三句。”

這是我正在使用的代碼：

sentences = response.xpath('//div[@class="discussionpost"]/text()').extract()

我理解為什么上面的代碼會這樣。 但是，我無法改變它來做我需要做的事情。 任何幫助是極大的贊賞。

Answer 1

要獲取所有文本節點值，您必須調用//text()而不是/text()

sentences = ' '.join(response.x`path('//div[@class="discussionpost"]//text()').extract()).strip()

經 scrapy shell 證明：

>>> from scrapy import Selector
>>> html_doc = '''
... <html>
...  <body>
...   <div class="discussionpost">
...    “This is paragraph one.”
...    <br/>
...    <br/>
...    “This is paragraph two."'
...    <br/>
...    <br/>
...    "This is paragraph three.”
...   </div>
...  </body>
... </html>
...
... '''
>>> res = Selector(text=html_doc)
>>> res
<Selector xpath=None data='<html>\n <body>\n  <div class="discussi...'>
>>> sentences = ''.join(res.xpath('//div[@class="discussionpost"]//text()').extract())
>>> sentences
'\n   “This is paragraph one.”\n   \n   \n   “This is paragraph two."\'\n   \n   \n   "This is paragraph three.”\n  '
>>> txt = sentences
>>> txt
'\n   “This is paragraph one.”\n   \n   \n   “This is paragraph two."\'\n   \n   \n   "This is paragraph three.”\n  '
>>> txt = sentences.replace('\n','').replace("\'",'').replace('    ','').replace("“",'').replace('”','').replace('"','').strip()
>>> txt
'This is paragraph one. This is paragraph two. This is paragraph three.'
>>>

更新：

import scrapy
class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.ibsgroup.org/threads/hemorrhoids-as-cause-of-pain.363290/']
     
    def parse(self, response):
        for p in response.xpath('//*[@class="bbWrapper"]'):
            yield {
            'comment': ''.join(p.xpath(".//text()").getall()).strip()
            }

XPath Scrapy 在 class 中加入由 br 標記分隔的文本節點

問題描述

1 個解決方案

解決方案1
1 已采納 2022-08-11 11:07:08

XPath Scrapy 在 class 中加入由 br 標記分隔的文本節點

問題描述

1 個解決方案

解決方案1 1 已采納 2022-08-11 11:07:08

解決方案1
1 已采納 2022-08-11 11:07:08