简体   繁体   中英

Scraping node with specific text using scrapy and xpath

I don't understand why the below doesn't work. I know there are related answers, but they didn't help me.

$ scrapy shell "http://edition.cnn.com"

There is an h2 tag with "CNN Money" as text inside. Why doesn't the below work?

>>> response.xpath('//h2[contains(string(), "CNN Money")]')
[]

I also tried text()

>>> response.xpath('//h2[contains(text(), "CNN Money")]')
[] 

It's not about XPath expression you use. The problem is that the page content is supplied dynamically eg by some JavaScript. Check yourself -- try to search for CNN Money in the page source code. You won't find any hit. You need to render the page and parse the output. I suggest you use Splash together with scrapy-splash library for that purpose.

EDIT:

Run Splash using this command:

docker run -d -p 8050:8050 --restart=always scrapinghub/splash --max-timeout 3600

It increases the maximum timeout for requests. (You can look at documentation about other options how to run Splash in production.) You also need to increase the timeout field in args parameter to SplashRequest , eg

yield scrapy_splash.SplashRequest(url, self.parse, endpoint='render.json', args={'timeout': 3600})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM