简体   繁体   中英

Scrapy cannot extract text from <a>

On: http://web.unep.org/inquiry/news , I want to get the headlines. Based on firefox Xpathchecker, the xpath is //div[@class='highlighter']/a (see http://i.stack.imgur.com/DeuG5.png )

But my code below gives me empty lines:

    import scrapy
    from unepinquiry.items import unepinquiryitem

    class unepInquirySpider(scrapy.Spider):
    name = "unepinquiry"
    allowed_domains = ["web.unep.org"]
    start_urls = ["http://web.unep.org/inquiry/news"]
    def parse(self, response):
        for sel in response.xpath('//div[@class="highlighter"]'):
            item = unepinquiryitem()
            item['title'] = sel.xpath('/a/text()').extract()
            yield item  

The website has incorrect HTML markup regarding links inside those <div class="highlighter"> elements:

<div class="highlighter">
<a target="_blank" href="http://unep.org/newscentre/Default.aspx?DocumentID=27071&amp;ArticleID=36155&amp;l=en" />
New Report Shows How India Can Scale up Sustainable Finance</a>
<p><strong>Mumbai, 29 April 2016</strong> - India has set ambitious goals for inclusive and sustainable development, which require the mobilization of additional low-cost, long-term capital. A new report launched today by the United Nations Environment Programme (UNEP) and the Federation of Indian Chambers of Commerce and Industry (FICCI) shows how the country is already introducing innovative approaches to attract private capital for green assets - and outlines a number of key steps to deepen this process in India.</p>
<p>The report, entitled <a href="http://unepinquiry.org/wp-content/uploads/2016/04/Delivering_a_Sustainable_Financial_System_in_India.pdf">Delivering a Sustainable Financial System in India</a> profiles the actions that have been taken to advance environmental and social factors as a core part of India's banking, capital markets, investment and insurance sectors. It was jointly produced by FICCI and the UNEP Inquiry into the Design of a Sustainable Financial System, backed by a high-level India advisory council.</p>

<div class="clear"></div>
</div>

The links tags are closed (scroll to right)

<a target="_blank"
   href="http://unep.org/newscentre/Default.aspx?DocumentID=27071&amp;ArticleID=36155&amp;l=en" />
                                                                                                ^
                                                                                                |
                                                                                              here

And the title has a trailing closing </a> tag:

New Report Shows How India Can Scale up Sustainable Finance</a>

This causes trouble to lxml parser (used by scrapy under the hood)

You can check how Scrapy "sees" the HTML by printing each <div> (calling .extract() on them to have the HTML serialization):

>>> for div in response.xpath('//div[@class="highlighter"]'):
...     print("-------------")
...     print(div.extract())
... 
-------------
<div class="highlighter">
<a target="_blank" href="http://unep.org/newscentre/Default.aspx?DocumentID=27071&amp;ArticleID=36155&amp;l=en"></a>
New Report Shows How India Can Scale up Sustainable Finance
<p><strong>Mumbai, 29 April 2016</strong> - India has set ambitious goals for inclusive and sustainable development, which require the mobilization of additional low-cost, long-term capital. A new report launched today by the United Nations Environment Programme (UNEP) and the Federation of Indian Chambers of Commerce and Industry (FICCI) shows how the country is already introducing innovative approaches to attract private capital for green assets - and outlines a number of key steps to deepen this process in India.</p>
<p>The report, entitled <a href="http://unepinquiry.org/wp-content/uploads/2016/04/Delivering_a_Sustainable_Financial_System_in_India.pdf">Delivering a Sustainable Financial System in India</a> profiles the actions that have been taken to advance environmental and social factors as a core part of India's banking, capital markets, investment and insurance sectors. It was jointly produced by FICCI and the UNEP Inquiry into the Design of a Sustainable Financial System, backed by a high-level India advisory council.</p>

<div class="clear"></div>
</div>
-------------
<div class="highlighter">
<a target="_blank" href="http://unep.org/newscentre/default.aspx?DocumentID=27071&amp;ArticleID=36139"></a>
Green Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth
<p><strong>Washington, D.C., 16 April 2016</strong> – The Paulson Institute and the Green Finance Committee of China Society for Finance and Banking convened a half-day symposium of global finance leaders and experts to discuss recommendations for the development of robust global green finance mechanisms and markets. The recommendations coming out of the meetings will be provided to the G20 Green Finance Study Group, which is chaired by the People’s Bank of China and the Bank of England. The study group will finalize a synthesized report for the G20. SIFMA, Bloomberg Philanthropies and United Nations Environment Programme also co-hosted the event.</p>

<div class="clear"></div>
</div>
(...)

So you can see that the title you're after become a text node just after the <a> .

In XPath, this is accessed using the following-sibling axis. So for each <a> , following-sibling::text() will select text nodes that come after ( text() being a "node test"), at the same level in the HTML tree ("sibling"):

>>> for div in response.xpath('//div[@class="highlighter"]'):
...     item = {}
...     item['title'] = div.xpath('./a/following-sibling::text()').extract()
...     print(item)
... 
{'title': ['\nNew Report Shows How India Can Scale up Sustainable Finance\n', '\n', '\n\n', '\n']}
{'title': ['\nGreen Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth\n', '\n\n', '\n']}
(...)
{'title': ['\nThe Inquiry Speaks at PRI in Person in London, UK\n', '\n\n', '\n']}
{'title': ['\nReshaping Finance for Sustainability\n', '\n\n', '\n']}
>>> 

You can see that following-sibling::text() is also matchin some other text nodes: '\\n\\n', '\\n' .

You can get rid of them using a [1] predicate at the end of the XPath expression to select the 1st match:

>>> for div in response.xpath('//div[@class="highlighter"]'):
...     item = {}
...     item['title'] = div.xpath('./a/following-sibling::text()[1]').extract()
...     print(item)
... 
{'title': ['\nNew Report Shows How India Can Scale up Sustainable Finance\n']}
{'title': ['\nGreen Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth\n']}
(...)
{'title': ['\nThe Inquiry Speaks at PRI in Person in London, UK\n']}
{'title': ['\nReshaping Finance for Sustainability\n']}
>>> 

You can also use .extract_first() to get only the 1st element, and not a list:

>>> for div in response.xpath('//div[@class="highlighter"]'):
...     item = {}
...     item['title'] = div.xpath('./a/following-sibling::text()').extract_first()
...     print(item)
... 
{'title': '\nNew Report Shows How India Can Scale up Sustainable Finance\n'}
{'title': '\nGreen Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth\n'}
(...)
{'title': '\nThe Inquiry Speaks at PRI in Person in London, UK\n'}
{'title': '\nReshaping Finance for Sustainability\n'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM