简体   繁体   English

抓取音乐网站以获取歌词

[英]Crawl a music website to get it lyrics

I want to crawl a lyrics website: http://mp3.zing.vn/bai-hat/Vi-Anh-La-Soai-Ca-Dam-Vinh-Hung/ZW78EUE8.html to get the song's name, artist, genre and lyrics.我想爬一个歌词网站: http : //mp3.zing.vn/bai-hat/Vi-Anh-La-Soai-Ca-Dam-Vinh-Hung/ZW78EUE8.html获取歌曲名称、艺术家、流派和歌词。 Then I write the following code and save it as mp3_spider.py然后我写了下面的代码并保存为mp3_spider.py

import scrapy

class MP3Spider(scrapy.Spider):
    name = "mp3"
    start_urls = ['http://mp3.zing.vn/bai-hat/Vi-Anh-La-Soai-Ca-Dam-Vinh-Hung/ZW78EUE8.html']

def parse(self, response):
    yield
    {
        'song': response.css('.txt-primary h1::text').extract()[0],
        'artist': response.css('.artist-track-log a::text').extract()[0],
        'genre': response.css('.genre-track-log::text').extract()[0],
        'lyrics': response.css('.fn-content::text').extract()[0]
    }

I ran it in command line:我在命令行中运行它:

scrapy runspider mp3_spider.py -o mp3.json

but it returns nothing.但它什么都不返回。 Can anyone show me how to make it works?谁能告诉我如何使它工作? Thank you very much for your help.非常感谢您的帮助。

Your class, MP3Spider , doesn't actually do anything because parse is a standalone function there.你的类MP3Spider实际上没有做任何事情,因为parse是一个独立的函数。 If you indent parse to match the indent like this, it'll at least run.如果您缩进parse以匹配这样的缩进,它至少会运行。

class MP3Spider(scrapy.Spider):
    name = "mp3"
    start_urls = ['http://mp3.zing.vn/bai-hat/Vi-Anh-La-Soai-Ca-Dam-Vinh-Hung/ZW78EUE8.html']

    def parse(self, response):
        yield
        {
            'song': response.css('.txt-primary h1::text').extract()[0],
            'artist': response.css('.artist-track-log a::text').extract()[0],
            'genre': response.css('.genre-track-log::text').extract()[0],
            'lyrics': response.css('.fn-content::text').extract()[0]
        }

I took the liberty of recreating the scenario and aside from the previous poster answer.... Indentation levels are extremely important on how Python interprets your code: what to do, not to do next or before.除了之前的海报答案之外,我冒昧地重新创建了场景......缩进级别对于 Python 如何解释您的代码非常重要:做什么,下一步或之前不做什么。 In addition:此外:

 def parse(self, response):
        yield
        {
            'song': response.css('.txt-primary h1::text').extract()[0],#here
            'artist': response.css('.artist-track-log a::text').extract()[0]#here,
            'genre': response.css('.genre-track-log::text').extract()[0],#here
            'lyrics': response.css('.fn-content::text').extract()[0]#here
        }

May I ask how do you come up with you extract values?请问你是怎么想出你的提取值的? I assume you might not use "scrappy shell 'your.com... I assume because by inserting what you had it would tell you that the ranges.. = [0] .... does not exist, at least for the path selected.我假设您可能不会使用“scrappy shell 'your.com...我假设是因为通过插入您拥有的内容,它会告诉您范围.. = [0] .... 不存在,至少对于路径而言被选中。

I took the liberty in fixing you code up.. But since I don't know Vietnamese , you might have to mess around with some of the regex.我冒昧地修复了你的代码......但由于我不会越南语,你可能不得不弄乱一些正则表达式。

Tips:提示:

  1. Though not necessarily important, when you are scraping content that has paragraphs, its best to go with using itemized selections, tend to make the grouping of large bodies of text easier and less regex needed in my experience.虽然不一定重要,但当您抓取具有段落的内容时,最好使用逐项选择,这往往会使大文本体的分组更容易,并且在我的经验中需要更少的正则表达式。

  2. Get used to using Scrapy shell function and do all your path selections in there.习惯使用 Scrapy shell 函数并在其中进行所有路径选择。 This will sqave you much time if you use it AND make a habit of very first thing type view(response).如果您使用它并养成首先输入视图(响应)的习惯,这将占用您很多时间。 Dynamically loaded pages or web pages that block Scrapy's default agent header just won't be easy as just a regular page (still pretty easy, there are always ways around that).动态加载的页面或阻止 Scrapy 的默认代理标头的网页不会像普通页面那样简单(仍然很容易,总有办法解决这个问题)。

_author_ = 'Tô Vạn Hưng'
__credits__ = 'scriptso' #just helping a brother from the far east
import scrapy

class MP3Spider(scrapy.Spider):
    name = "mp3"
    start_urls = ['http://mp3.zing.vn/bai-hat/Vi-Anh-La-Soai-Ca-Dam-Vinh-Hung/ZW78EUE8.html']

    def parse(self, response):
        yield
        {
            'song': response.css('s.fn-name::text').extract(),
            'artist': response.css('.inline h2::text').extract_first(),
            'genre': response.xpath("//div[@class='inline']/h2/a[contains(font.font,'')]//text()").re('[^\n].*\w')[2:],
            'lyrics': response.css('.fn-wlyrics.fn-content::text').re('[^\n].*\w+'),
        }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM