Scraping node with specific text using scrapy and xpath

Question

I don't understand why the below doesn't work. I know there are related answers, but they didn't help me.

$ scrapy shell "http://edition.cnn.com"

There is an h2 tag with "CNN Money" as text inside. Why doesn't the below work?

>>> response.xpath('//h2[contains(string(), "CNN Money")]')
[]

I also tried text()

>>> response.xpath('//h2[contains(text(), "CNN Money")]')
[]

Answer 1

It's not about XPath expression you use. The problem is that the page content is supplied dynamically eg by some JavaScript. Check yourself -- try to search for CNN Money in the page source code. You won't find any hit. You need to render the page and parse the output. I suggest you use Splash together with scrapy-splash library for that purpose.

EDIT:

Run Splash using this command:

docker run -d -p 8050:8050 --restart=always scrapinghub/splash --max-timeout 3600

It increases the maximum timeout for requests. (You can look at documentation about other options how to run Splash in production.) You also need to increase the timeout field in args parameter to SplashRequest , eg

yield scrapy_splash.SplashRequest(url, self.parse, endpoint='render.json', args={'timeout': 3600})

Scraping node with specific text using scrapy and xpath

Question

1 answers

solution1
2 ACCPTED 2017-08-23 13:31:53

Scraping node with specific text using scrapy and xpath

Question

1 answers

solution1 2 ACCPTED 2017-08-23 13:31:53

solution1
2 ACCPTED 2017-08-23 13:31:53