简体   繁体   English

scrapy shell xpath从itunes.apple.com返回空列表

[英]scrapy shell xpath returns empty list from itunes.apple.com

scrapy shell 'https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4'

I wanted to get album "no tears left to cry - Single" from here, 我想从这里得到专辑“泪流满面-Single”,

Itunes chart _ music preview page "no tears left to cry - Single / Ariana Grande" Itunes图表_音乐预览页“无泪可哭-Single / Ariana Grande”

the album name's xpath is this : //*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1 专辑名称的xpath是这样的: //*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1

and i tried to 我试图

response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1')

but result was [] 但结果是[]

how can I get album informations from this wepsite? 我如何从此wepsite获取专辑信息?

This is because scrapy don't wait for javascript load, you need to use scrapy-splash , here is my answer how you need to setup you scrapy-project with scrapy-splash 这是因为scrapy不等待JavaScript的负载,则需要使用scrapy飞溅这里是我的答案,你需要怎么设置你scrapy项目与scrapy-splash

If i use scrapy-splash i get the results 如果我使用scrapy-splash我会得到结果

2018-06-30 20:50:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27 via http://localhost:8050/render.html> (referer: None)
2018-06-30 20:50:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27>
{'title': 'no tears left to cry - Single'}

Here is my simple spider 这是我简单的蜘蛛

import scrapy
from scrapy_splash import SplashRequest


class TestSpider(scrapy.Spider):
    name = "test"

    start_urls = ['https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4%27']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url,
                                callback=self.parse,
                                endpoint='render.html',
                                )

    def parse(self, response):
        yield {'title': response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1//text()').extract_first()}

Also you can do this with scrapy shell 你也可以用scrapy shell做到这scrapy shell

scrapy shell 'http://localhost:8050/render.html?url=https://itunes.apple.com/us/album/no-tears-left-to-cry/1374085537?i=1374087460&v0=WWW-NAUS-ITSTOP100-SONGS&l=en&ign-mpt=uo%3D4'

In [2]: response.xpath('//*[@id="ember653"]/section[1]/div/div[2]/div[1]/div[2]/header/h1//text()').extract_first()
Out[2]: 'no tears left to cry - Single'

You'd better avoid JS rendering, which is damn slow, heavy and buggy. 您最好避免JS渲染,该渲染太慢,繁琐且容易出错。 Spend 5 minutes in Chrome's "network" tab to find the source of data. 在Chrome的“网络”标签上花费5分钟,以查找数据源。 It is usually built-in to the source of page or delivered via XHR requests. 它通常内置在页面源中或通过XHR请求传递。

In this case, all the data you want can be found on the page itself, but you should check its source code, not the rendered version. 在这种情况下,所需的所有数据都可以在页面本身上找到,但是您应该检查其源代码,而不是呈现的版本。 Use ctrl+u in chrome and then ctrl+f to find all the needed parts. 在chrome中使用ctrl+u ,然后使用ctrl+f查找所有需要的部分。

import json

track_data = response.xpath('//script[@name="schema:music-album"]/text()').extract_first()
track_json = json.loads(track_data)
track_title = track_json['name']
yield {'title': track_title}

Will do the trick in this case and will work about 5-7 times faster than splash 在这种情况下可以解决问题,并且比splash速度快约5-7倍

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM