Scrapy。从div中提取html而不包装父标记

Question

I use scrapy to crawl a website. 我使用scrapy来抓取一个网站。

I want to extract contents of certain div. 我想提取某些div的内容。

<div class="short-description">
{some mess with text, <br>, other html tags, etc}
</div>

loader.add_xpath('short_description', "//div[@class='short-description']/div")

By that code I get what I need but result includes wrapping html ( <div class="short-description">...</div> ) 通过该代码我得到我需要的但结果包括包装html（ <div class="short-description">...</div> ）

How to get rid of that parent html tag? 如何摆脱那个父html标签？

Note . 注意。 Selector like text(), node() cannot help me, because my div contains <br>, <p>, other divs, etc. , whitespaces, and I need to keep them. 像text（），node（）这样的选择器无法帮助我，因为我的div包含<br>, <p>, other divs, etc. ，空格，我需要保留它们。

Answer 1

hxs = HtmlXPathSelector(response)
for text in hxs.select("//div[@class='short-description']/text()").extract(): 
    print text

Answer 2

Try node() in combination with Join() : 尝试将node()与Join()结合使用：

loader.get_xpath('//div[@class="short-description"]/node()', Join())

and the results look something like: 结果看起来像：

>>> from scrapy.contrib.loader import XPathItemLoader
>>> from scrapy.contrib.loader.processor import Join
>>> from scrapy.http import HtmlResponse
>>>
>>> body = """
...     <html>
...         <div class="short-description">
...             {some mess with text, <br>, other html tags, etc}
...             <div>
...                 <p>{some mess with text, <br>, other html tags, etc}</p>
...             </div>
...             <p>{some mess with text, <br>, other html tags, etc}</p>
...         </div>
...     </html>
... """
>>> response = HtmlResponse(url='http://example.com/', body=body)
>>>
>>> loader = XPathItemLoader(response=response)
>>>
>>> print loader.get_xpath('//div[@class="short-description"]/node()', Join())

            {some mess with text,  <br> , other html tags, etc}
             <div>
                <p>{some mess with text, <br>, other html tags, etc}</p>
            </div>
             <p>{some mess with text, <br>, other html tags, etc}</p>
>>>
>>> loader.get_xpath('//div[@class="short-description"]/node()', Join())
u'\n            {some mess with text,  <br> , other html tags, etc}\n
   <div>\n         <p>{some mess with text, <br>, other html tags, etc}</p>\n
   </div> \n     <p>{some mess with text, <br>, other html tags, etc}</p> \n'

Scrapy。从div中提取html而不包装父标记

问题描述

2 个解决方案

解决方案1
2 2013-03-26 01:35:55

解决方案2
2 2013-03-26 03:33:00

Scrapy。 从div中提取html而不包装父标记

问题描述

2 个解决方案

解决方案1 2 2013-03-26 01:35:55

解决方案2 2 2013-03-26 03:33:00

Scrapy。从div中提取html而不包装父标记

解决方案1
2 2013-03-26 01:35:55

解决方案2
2 2013-03-26 03:33:00