简体   繁体   English

如何通过 xpath 从 htmlresponse 获取所有 br 标签的文本?

[英]How can I get text from htmlresponse for all br tag by xpath?

I use scrapy to get a object with type htmlresponse for example:我使用scrapy来获取一个类型为htmlresponse的对象,例如:

<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>

and I want get all text between p,but I try:我想获取 p 之间的所有文本,但我尝试:

response.xpath('//p[@class="post"]/text()').extract()

but the result have 6 lines from line1 to line6,and I want to get all text between ap tag,for example: #first p line1 line2 line3 #second p line4 line5 line6 How can I do it?但是结果从第 1 行到第 6 行有 6 行,我想获取 ap 标记之间的所有文本,例如:#first p line1 line2 line3 #second p line4 line5 line6 我该怎么做?

如果您使用的是xslt 2.0 ,则可以使用string-join功能。

string-join(//p[@class="post"]/text())

you can use BeautifulSoup too for parse html(pip install BeautifulSoup4)你也可以使用 BeautifulSoup 来解析 html(pip install BeautifulSoup4)

from bs4 import BeautifulSoup

html = """
<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>
"""
soup = BeautifulSoup(html, "html.parser")
p=soup.find_all('p')
for x in p:
  print(x.text)

result :结果 :

line1
line2
line3


line4
line5
line6

Why do you have to use XPath?为什么必须使用 XPath? BS4 is a good solution. BS4 是一个很好的解决方案。 So is SimplifiedDoc SimplifiedDoc 也是如此

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html='''<p class="post">
        line1<br />
        line2<br />
        line3<br />
</p>
<p class="post">
        line4<br />
        line5<br />
        line6<br />
</p>
'''
doc = SimplifiedDoc(html)
p = doc.getElementsByClass('post')
for x in p:
  print (x.html)
  print (doc.removeHtml(x.html,' '))

result:结果:

line1<br />line2<br />line3<br />
line1 line2 line3
line4<br />line5<br />line6<br />
line4 line5 line6

Simply write response.css('p ::text').extract() .只需编写response.css('p ::text').extract() You can also use the class attribute response.css('.post ::text').extract()您还可以使用类属性response.css('.post ::text').extract()

With scrapy selectors you need something like this:使用scrapy选择器,你需要这样的东西:

result = [[line.strip("\n ") for line in p_tag.css("*::text").extract() if line.strip("\n ")]
          for p_tag in response.css("p.post")]

#result= [['line1', 'line2', 'line3'], ['line4', 'line5', 'line6']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM