[英]Scrapy. Extract html from div without wrapping parent tag
I use scrapy to crawl a website. 我使用scrapy来抓取一个网站。
I want to extract contents of certain div. 我想提取某些div的内容。
<div class="short-description">
{some mess with text, <br>, other html tags, etc}
</div>
loader.add_xpath('short_description', "//div[@class='short-description']/div")
By that code I get what I need but result includes wrapping html ( <div class="short-description">...</div>
) 通过该代码我得到我需要的但结果包括包装html( <div class="short-description">...</div>
)
How to get rid of that parent html tag? 如何摆脱那个父html标签?
Note . 注意 。 Selector like text(), node() cannot help me, because my div contains <br>, <p>, other divs, etc.
, whitespaces, and I need to keep them. 像text(),node()这样的选择器无法帮助我,因为我的div包含<br>, <p>, other divs, etc.
,空格,我需要保留它们。
hxs = HtmlXPathSelector(response)
for text in hxs.select("//div[@class='short-description']/text()").extract():
print text
Try node()
in combination with Join()
: 尝试将node()
与Join()
结合使用:
loader.get_xpath('//div[@class="short-description"]/node()', Join())
and the results look something like: 结果看起来像:
>>> from scrapy.contrib.loader import XPathItemLoader
>>> from scrapy.contrib.loader.processor import Join
>>> from scrapy.http import HtmlResponse
>>>
>>> body = """
... <html>
... <div class="short-description">
... {some mess with text, <br>, other html tags, etc}
... <div>
... <p>{some mess with text, <br>, other html tags, etc}</p>
... </div>
... <p>{some mess with text, <br>, other html tags, etc}</p>
... </div>
... </html>
... """
>>> response = HtmlResponse(url='http://example.com/', body=body)
>>>
>>> loader = XPathItemLoader(response=response)
>>>
>>> print loader.get_xpath('//div[@class="short-description"]/node()', Join())
{some mess with text, <br> , other html tags, etc}
<div>
<p>{some mess with text, <br>, other html tags, etc}</p>
</div>
<p>{some mess with text, <br>, other html tags, etc}</p>
>>>
>>> loader.get_xpath('//div[@class="short-description"]/node()', Join())
u'\n {some mess with text, <br> , other html tags, etc}\n
<div>\n <p>{some mess with text, <br>, other html tags, etc}</p>\n
</div> \n <p>{some mess with text, <br>, other html tags, etc}</p> \n'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.