简体   繁体   中英

Scrapy xpath not working (maybe something with parbase?)

This is the URL I was trying it out on. I was trying to get the body content of the article; "Co-viewing in television...". I've tried the following expressions:

[In 1]:response.xpath("//*[contains(@class, 'text parbase')]//text()").extract()
[Out 1]:[]

[In 2]:response.xpath("//*[contains(@class, 'text')]//text()").extract()
[Out 2]: [u'\n',
 u'\n',
 u'\n\n',
 u'\n    $CQ(function() {\n        CQ_Analytics.SegmentMgr.loadSegments("/etc/segmentation");\n         CQ_Analytics.ClientContextUtils.init("","/content/corporate/us/en/insights/journal-of-measurement/volume-1-issue-2/nott-alone-is-ott-making-it-cool-again-to-watch-tv-together");\n\n        \n    });\n',
 u'\n']

[In 3]:response.xpath("//p//text()").extract()
[Out 3]:[u'X']

And none of them seem to contain what I want it to get. Am I doing something wrong here? If this has already been answered, I'm sorry, I've tried my best to find an answer, but haven't managed to find anything just yet. Any help would be greatly appreciated. Thanks!

There seems to be some kind of problem on the HTML output of the website and the Scrapy parser is unable to render that section. You could extract the content using regular expresions to get a fix on that:

import re
from scrapy import Selector

section = re.match(r'.*(<div.*?parbase toptext.*?)</div>', response.body, re.DOTALL).group(1)
Selector(text=section).xpath('//text()').extract()

From what I can see that page contains the following line:

<li><script src="https://apis.google.com/js/platform.js" asyncdefer=[NULL][NULL]

where [NULL] stands for a null byte.

This seems to throw off the parser. If I construct a selector using the response body with null bytes removed then it works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM