[英]How can I get text from htmlresponse for all br tag by xpath?
I use scrapy to get a object with type htmlresponse for example:我使用scrapy来获取一个类型为htmlresponse的对象,例如:
<p class="post">
line1<br />
line2<br />
line3<br />
</p>
<p class="post">
line4<br />
line5<br />
line6<br />
</p>
and I want get all text between p,but I try:我想获取 p 之间的所有文本,但我尝试:
response.xpath('//p[@class="post"]/text()').extract()
but the result have 6 lines from line1 to line6,and I want to get all text between ap tag,for example: #first p line1 line2 line3 #second p line4 line5 line6 How can I do it?但是结果从第 1 行到第 6 行有 6 行,我想获取 ap 标记之间的所有文本,例如:#first p line1 line2 line3 #second p line4 line5 line6 我该怎么做?
如果您使用的是xslt 2.0 ,则可以使用string-join
功能。
string-join(//p[@class="post"]/text())
you can use BeautifulSoup too for parse html(pip install BeautifulSoup4)你也可以使用 BeautifulSoup 来解析 html(pip install BeautifulSoup4)
from bs4 import BeautifulSoup
html = """
<p class="post">
line1<br />
line2<br />
line3<br />
</p>
<p class="post">
line4<br />
line5<br />
line6<br />
</p>
"""
soup = BeautifulSoup(html, "html.parser")
p=soup.find_all('p')
for x in p:
print(x.text)
result :结果 :
line1
line2
line3
line4
line5
line6
Why do you have to use XPath?为什么必须使用 XPath? BS4 is a good solution.
BS4 是一个很好的解决方案。 So is SimplifiedDoc
SimplifiedDoc 也是如此
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''<p class="post">
line1<br />
line2<br />
line3<br />
</p>
<p class="post">
line4<br />
line5<br />
line6<br />
</p>
'''
doc = SimplifiedDoc(html)
p = doc.getElementsByClass('post')
for x in p:
print (x.html)
print (doc.removeHtml(x.html,' '))
result:结果:
line1<br />line2<br />line3<br />
line1 line2 line3
line4<br />line5<br />line6<br />
line4 line5 line6
Simply write response.css('p ::text').extract()
.只需编写
response.css('p ::text').extract()
。 You can also use the class attribute response.css('.post ::text').extract()
您还可以使用类属性
response.css('.post ::text').extract()
With scrapy selectors you need something like this:使用scrapy选择器,你需要这样的东西:
result = [[line.strip("\n ") for line in p_tag.css("*::text").extract() if line.strip("\n ")]
for p_tag in response.css("p.post")]
#result= [['line1', 'line2', 'line3'], ['line4', 'line5', 'line6']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.