Scrapy xpath not working (maybe something with parbase?)

Question

This is the URL I was trying it out on. I was trying to get the body content of the article; "Co-viewing in television...". I've tried the following expressions:

[In 1]:response.xpath("//*[contains(@class, 'text parbase')]//text()").extract()
[Out 1]:[]

[In 2]:response.xpath("//*[contains(@class, 'text')]//text()").extract()
[Out 2]: [u'\n',
 u'\n',
 u'\n\n',
 u'\n    $CQ(function() {\n        CQ_Analytics.SegmentMgr.loadSegments("/etc/segmentation");\n         CQ_Analytics.ClientContextUtils.init("","/content/corporate/us/en/insights/journal-of-measurement/volume-1-issue-2/nott-alone-is-ott-making-it-cool-again-to-watch-tv-together");\n\n        \n    });\n',
 u'\n']

[In 3]:response.xpath("//p//text()").extract()
[Out 3]:[u'X']

And none of them seem to contain what I want it to get. Am I doing something wrong here? If this has already been answered, I'm sorry, I've tried my best to find an answer, but haven't managed to find anything just yet. Any help would be greatly appreciated. Thanks!

Answer 1

There seems to be some kind of problem on the HTML output of the website and the Scrapy parser is unable to render that section. You could extract the content using regular expresions to get a fix on that:

import re
from scrapy import Selector

section = re.match(r'.*(<div.*?parbase toptext.*?)</div>', response.body, re.DOTALL).group(1)
Selector(text=section).xpath('//text()').extract()

Answer 2

From what I can see that page contains the following line:

<li><script src="https://apis.google.com/js/platform.js" asyncdefer=[NULL][NULL]

where [NULL] stands for a null byte.

This seems to throw off the parser. If I construct a selector using the response body with null bytes removed then it works.

Scrapy xpath not working (maybe something with parbase?)

Question

2 answers

solution1
1 ACCPTED 2017-09-20 08:26:51

solution2
1 2017-09-20 08:30:05

Scrapy xpath not working (maybe something with parbase?)

Question

2 answers

solution1 1 ACCPTED 2017-09-20 08:26:51

solution2 1 2017-09-20 08:30:05

solution1
1 ACCPTED 2017-09-20 08:26:51

solution2
1 2017-09-20 08:30:05