简体   繁体   中英

XPath returns empty list. Why is it ignoring targeted div element?

I'm a newbee to XPath and Scrapy. I'm trying to target a node which does not have a unique class (ie class="pubBody" ).

Already tried: xpath not contains A and B

This should be a simple task but XPath just misses the second item. I am doing this from the scrapy shell. On the command prompt:

scrapy shell " http://www.sciencedirect.com/science/journal/00221694/ "

I am looking for the second div:

<div id="issueListHeader" class="pubBody">...< /div>

<div class="pubBody">... < /div> 

I can only get the first but not the second. The best answers to similar questions suggested trying something like:

hxs.xpath('//div[contains(@class,"pubBody") and not(contains(@id,"issueListHeader"))]') 

but this returns an empty list for some reason. Any help please? Must be missing something silly, I've tried this for days!

Other details:

Once in the scrapy shell:

import scrapy

xs = scrapy.Selector(response)

hxs.xpath('//div[@class="pubBody"]')

Which works only for the first div element:

[<Selector xpath='//div[@class="pubBody"]' data='<div id="issueListHeader" class="pubBody'>]

For the failed second div element I've also tried:

hxs.xpath('//div[@class="pubBody" and not(@id="issueListHeader")]').extract_first()

hxs.xpath('//div[starts-with(@class, "pubBody") and not(re:test(@id, "issueListHeader"))]')

Also directly copied the XPath from Chrome, but also returns '[]':

hxs.xpath('//*[@id="issueList"]/div/form/div[2]')

The problem is that the HTML is very far from being well-formed on this page . To demonstrate, see how the same exact CSS selector is producing 0 results with Scrapy and producing 94 in BeautifulSoup :

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup(response.body, 'html5lib')  # note: "html5lib" has to be installed

In [3]: len(soup.select(".article h4 a"))
Out[3]: 94

In [4]: len(response.css(".article h4 a"))
Out[4]: 0

Same goes for the pubBody element you are trying to locate:

In [6]: len(response.css(".pubBody"))
Out[6]: 1

In [7]: len(soup.select(".pubBody"))
Out[7]: 2

So, try hooking up BeautifulSoup to fix/clean up the HTML - ideally through a middleware .


I've created a simple scrapy_beautifulsoup middleware to easily hook up into the project:

  • install it via pip:

     pip install scrapy-beautifulsoup
  • configure the middleware in settings.py :

     DOWNLOADER_MIDDLEWARES = { 'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 543 } BEAUTIFULSOUP_PARSER = "html5lib"

Profit.

I suspect the problem is that the source for the page you'r trying to parse ( http://www.sciencedirect.com/science/journal/00221694/ ) is not valid XML due to the <link ...> nodes/elements/tags not having closing tags. There may be other problems, but those are the first ones I found.

I'm rusty on Javascript, but you may try navigating down the DOM down to a lower level in the page (ie. body or some other node closer to the elements you're trying to target) and then perform the XPath from that level.

UPDATE: I just tried removing the <head> of the document and passing it through an XML parser and it still breaks on sever <input> nodes that are not closed. Unless I'm forgetting some special JavaScript XML/XPath rules methods that dismiss closing tags I suspect you might be better suited to use something like JQuery to find the elements you're looking for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM