如何通过 xpath 从 htmlresponse 获取所有 br 标签的文本？

Question

I use scrapy to get a object with type htmlresponse for example:我使用scrapy来获取一个类型为htmlresponse的对象，例如：

<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>

and I want get all text between p,but I try:我想获取 p 之间的所有文本，但我尝试：

response.xpath('//p[@class="post"]/text()').extract()

but the result have 6 lines from line1 to line6,and I want to get all text between ap tag,for example: #first p line1 line2 line3 #second p line4 line5 line6 How can I do it?但是结果从第 1 行到第 6 行有 6 行，我想获取 ap 标记之间的所有文本，例如：#first p line1 line2 line3 #second p line4 line5 line6 我该怎么做？

Answer 1

如果您使用的是xslt 2.0 ，则可以使用string-join功能。

string-join(//p[@class="post"]/text())

Answer 2

you can use BeautifulSoup too for parse html(pip install BeautifulSoup4)你也可以使用 BeautifulSoup 来解析 html（pip install BeautifulSoup4）

from bs4 import BeautifulSoup

html = """
<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>
"""
soup = BeautifulSoup(html, "html.parser")
p=soup.find_all('p')
for x in p:
  print(x.text)

result :结果：

line1
line2
line3


line4
line5
line6

Answer 3

Why do you have to use XPath?为什么必须使用 XPath？ BS4 is a good solution. BS4 是一个很好的解决方案。 So is SimplifiedDoc SimplifiedDoc 也是如此

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html='''<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>
'''
doc = SimplifiedDoc(html)
p = doc.getElementsByClass('post')
for x in p:
  print (x.html)
  print (doc.removeHtml(x.html,' '))

result:结果：

line1<br />line2<br />line3<br />
line1 line2 line3
line4<br />line5<br />line6<br />
line4 line5 line6

Answer 4

Simply write response.css('p ::text').extract() .只需编写response.css('p ::text').extract() 。 You can also use the class attribute response.css('.post ::text').extract()您还可以使用类属性response.css('.post ::text').extract()

Answer 5

With scrapy selectors you need something like this:使用scrapy选择器，你需要这样的东西：

result = [[line.strip("\n ") for line in p_tag.css("*::text").extract() if line.strip("\n ")]
          for p_tag in response.css("p.post")]

#result= [['line1', 'line2', 'line3'], ['line4', 'line5', 'line6']]

如何通过 xpath 从 htmlresponse 获取所有 br 标签的文本？

问题描述

5 个解决方案

解决方案1
0 2019-12-20 02:56:42

解决方案2
0 2019-12-20 03:04:15

解决方案3
0 2019-12-20 06:24:01

解决方案4
0 2019-12-20 09:42:03

解决方案5
0 2019-12-21 19:02:50

如何通过 xpath 从 htmlresponse 获取所有 br 标签的文本？

问题描述

5 个解决方案

解决方案1 0 2019-12-20 02:56:42

解决方案2 0 2019-12-20 03:04:15

解决方案3 0 2019-12-20 06:24:01

解决方案4 0 2019-12-20 09:42:03

解决方案5 0 2019-12-21 19:02:50

解决方案1
0 2019-12-20 02:56:42

解决方案2
0 2019-12-20 03:04:15

解决方案3
0 2019-12-20 06:24:01

解决方案4
0 2019-12-20 09:42:03

解决方案5
0 2019-12-21 19:02:50