如何通過 xpath 從 htmlresponse 獲取所有 br 標簽的文本？

Question

我使用scrapy來獲取一個類型為htmlresponse的對象，例如：

<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>

我想獲取 p 之間的所有文本，但我嘗試：

response.xpath('//p[@class="post"]/text()').extract()

但是結果從第 1 行到第 6 行有 6 行，我想獲取 ap 標記之間的所有文本，例如：#first p line1 line2 line3 #second p line4 line5 line6 我該怎么做？

Answer 1

如果您使用的是xslt 2.0 ，則可以使用string-join功能。

string-join(//p[@class="post"]/text())

Answer 2

你也可以使用 BeautifulSoup 來解析 html（pip install BeautifulSoup4）

from bs4 import BeautifulSoup

html = """
<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>
"""
soup = BeautifulSoup(html, "html.parser")
p=soup.find_all('p')
for x in p:
  print(x.text)

結果：

line1
line2
line3


line4
line5
line6

Answer 3

為什么必須使用 XPath？ BS4 是一個很好的解決方案。 SimplifiedDoc 也是如此

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html='''<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>
'''
doc = SimplifiedDoc(html)
p = doc.getElementsByClass('post')
for x in p:
  print (x.html)
  print (doc.removeHtml(x.html,' '))

結果：

line1<br />line2<br />line3<br />
line1 line2 line3
line4<br />line5<br />line6<br />
line4 line5 line6

Answer 4

只需編寫response.css('p ::text').extract() 。 您還可以使用類屬性response.css('.post ::text').extract()

Answer 5

使用scrapy選擇器，你需要這樣的東西：

result = [[line.strip("\n ") for line in p_tag.css("*::text").extract() if line.strip("\n ")]
          for p_tag in response.css("p.post")]

#result= [['line1', 'line2', 'line3'], ['line4', 'line5', 'line6']]

如何通過 xpath 從 htmlresponse 獲取所有 br 標簽的文本？

問題描述

5 個解決方案

解決方案1
0 2019-12-20 02:56:42

解決方案2
0 2019-12-20 03:04:15

解決方案3
0 2019-12-20 06:24:01

解決方案4
0 2019-12-20 09:42:03

解決方案5
0 2019-12-21 19:02:50

如何通過 xpath 從 htmlresponse 獲取所有 br 標簽的文本？

問題描述

5 個解決方案

解決方案1 0 2019-12-20 02:56:42

解決方案2 0 2019-12-20 03:04:15

解決方案3 0 2019-12-20 06:24:01

解決方案4 0 2019-12-20 09:42:03

解決方案5 0 2019-12-21 19:02:50

解決方案1
0 2019-12-20 02:56:42

解決方案2
0 2019-12-20 03:04:15

解決方案3
0 2019-12-20 06:24:01

解決方案4
0 2019-12-20 09:42:03

解決方案5
0 2019-12-21 19:02:50