简体   繁体   English

关于Rap Genius w / Python的Web Scraping Rap歌词

[英]Web Scraping Rap lyrics on Rap Genius w/ Python

I am somewhat of a coding novice, and I have been trying to scrape Andre 3000's lyrics off Rap genius, http://genius.com/artists/Andre-3000 , by using Beautiful Soup (A Python library for pulling data out of HTML and XML files). 我有点像编码新手,我一直试图通过使用Beautiful Soup(用于从HTML中提取数据的Python库)从Rap天才http://genius.com/artists/Andre-3000中删除Andre 3000的歌词。和XML文件)。 My end goal is to have the data in a string format. 我的最终目标是以字符串格式提供数据。 Here is what I have so far: 这是我到目前为止:

from bs4 import BeautifulSoup
from urllib2 import urlopen

artist_url = "http://rapgenius.com/artists/Andre-3000"

def get_song_links(url):
    html = urlopen(url).read()
    # print html 
    soup = BeautifulSoup(html, "lxml")
    container = soup.find("div", "container")
    song_links = [BASE_URL + dd.a["href"] for dd in container.findAll("dd")]

    print song_links

get_song_links(artist_url)
for link in soup.find_all('a'):
    print(link.get('href'))

So I need help with the rest of the code. 所以我需要其他代码的帮助。 How do I get his lyrics into string format? 如何将他的歌词变成字符串格式? and then how do I use the Natural Language Toolkit (NLTK) to token the sentences and words. 然后我如何使用自然语言工具包(NLTK)来标记句子和单词。

Here's an example, how to grab all of the song links on the page, follow them and get the song lyrics: 这是一个例子,如何获取页面上的所有歌曲链接,关注它们并获得歌词:

from urlparse import urljoin
from bs4 import BeautifulSoup
import requests


BASE_URL = "http://genius.com"
artist_url = "http://genius.com/artists/Andre-3000/"

response = requests.get(artist_url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36'})

soup = BeautifulSoup(response.text, "lxml")
for song_link in soup.select('ul.song_list > li > a'):
    link = urljoin(BASE_URL, song_link['href'])
    response = requests.get(link)
    soup = BeautifulSoup(response.text)
    lyrics = soup.find('div', class_='lyrics').text.strip()

    # tokenize `lyrics` with nltk

Note that requests module is used here. 请注意,此处使用requests模块。 Also note that User-Agent header is required since the site returns 403 - Forbidden without it. 另请注意, User-Agent标头是必需的,因为站点返回403 - Forbidden没有它的403 - Forbidden

First, for each link you will need to download that page and parse it with BeautifulSoup. 首先,对于每个链接,您需要下载该页面并使用BeautifulSoup进行解析。 Then look for a distinguishing attribute on that page that separates lyrics from other page content. 然后在该页面上查找区分属性,将歌词与其他页面内容分开。 I found <a data-editorial-state="accepted" data-classification="accepted" data-group="0"> to be helpful. 我发现<a data-editorial-state="accepted" data-classification="accepted" data-group="0">有所帮助。 Then run a .find_all on the lyrics page content to get all lyric lines. 然后在歌词页面内容上运行.find_all以获取所有歌词行。 For each line you can call .get_text() to get the text from each lyrics line. 对于每一行,您可以调用.get_text()来获取每个歌词行的文本。

As for NLTK, once it is installed you can import it and parse sentences like so: 对于NLTK,一旦安装完毕,您可以导入它并解析句子,如下所示:

from nltk.tokenize import word_tokenize, sent_tokenize
words = [word_tokenize(t) for t in sent_tokenize(lyric_text)]

This will give you a list of all words in each sentence. 这将为您提供每个句子中所有单词的列表。

GitHub / jashanj0tsingh / LyricsScraper.py provides basic scraping of lyrics off genius.com into a text file where each row represents a song. GitHub / jashanj0tsingh / LyricsScraper.py在genius.com上提供基本的歌词扫描到文本文件中,其中每行代表一首歌。 It takes the artist's name as an input. 它以艺术家的名字作为输入。 The generated text file then can be easily fed to your custom nltk or the general parser to do stuff you want. 然后,生成的文本文件可以轻松地提供给您的自定义nltk或一般解析器来执行您想要的操作。

The code is below: 代码如下:

# A simple script to scrape lyrics from the genius.com based on atrtist name.

import re
import requests
import time
import codecs

from bs4 import BeautifulSoup
from selenium import webdriver

mybrowser = webdriver.Chrome("path\to\chromedriver\binary") # Browser and path to Web driver you wish to automate your tests cases.

user_input = input("Enter Artist Name = ").replace(" ","+") # User_Input = Artist Name
base_url = "https://genius.com/search?q="+user_input # Append User_Input to search query
mybrowser.get(base_url) # Open in browser

t_sec = time.time() + 60*20 # seconds*minutes
while(time.time()<t_sec): # Reach the bottom of the page as per time for now TODO: Better condition to check end of page.
    mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    html = mybrowser.page_source
    soup = BeautifulSoup(html, "html.parser")
    time.sleep(5)

pattern = re.compile("[\S]+-lyrics$") # Filter http links that end with "lyrics".
pattern2 = re.compile("\[(.*?)\]") # Remove unnecessary text from the lyrics such as [Intro], [Chorus] etc..

with codecs.open('lyrics.txt','a','utf-8-sig') as myfile:
    for link in soup.find_all('a',href=True):
            if pattern.match(link['href']):
                f = requests.get(link['href'])
                lyricsoup = BeautifulSoup(f.content,"html.parser")
                #lyrics = lyricsoup.find("lyrics").get_text().replace("\n","") # Each song in one line.
                lyrics = lyricsoup.find("lyrics").get_text() # Line by Line
                lyrics = re.sub(pattern2, "", lyrics)
                myfile.write(lyrics+"\n")
mybrowser.close()
myfile.close()

Hope this is still relevant! 希望这仍然相关! I'm doing the same thing with Eminem's lyrics, but from lyrics.com. 我正在用Eminem的歌词做同样的事情,但是来自lyrics.com。 Does it have to be from Rap Genius? 它必须来自Rap Genius吗? I found lyrics.com to be easier to scrape. 我发现lyrics.com更容易刮。

To get Andre 3000's just change the code accordingly. 要获得Andre 3000,只需相应地更改代码即可。

Here's my code; 这是我的代码; it gets song links and then scrapes those pages to get lyrics and appends the lyrics to a list: 它获取歌曲链接然后抓取这些页面以获取歌词并将歌词附加到列表:

import re
import requests
import nltk
from bs4 import BeautifulSoup

url = 'http://www.lyrics.com/eminem'
r = requests.get(url)
soup = BeautifulSoup(r.content)
gdata = soup.find_all('div',{'class':'row'})

eminemLyrics = []

for item in gdata:
    title = item.find_all('a',{'itemprop':'name'})[0].text
    lyricsdotcom = 'http://www.lyrics.com'
    for link in item('a'):
        try:
            lyriclink = lyricsdotcom+link.get('href')
            req = requests.get(lyriclink)
            lyricsoup = BeautifulSoup(req.content)
            lyricdata = lyricsoup.find_all('div',{'id':re.compile('lyric_space|lyrics')})[0].text
            eminemLyrics.append([title,lyricdata])
            print title
            print lyricdata
            print
        except:
            pass

This will give you the lyrics in a list. 这将为您提供列表中的歌词。 To print all titles: 要打印所有标题:

titles = [i[0] for i in eminemLyrics]
print titles

To get a specific song: 要获得特定歌曲:

titles.index('Cleaning out My Closet')
120

To tokenize the song, plug that value ( 120 ) in: 要标记歌曲,请将该值( 120 )插入:

song = nltk.word_tokenize(eminemLyrics[120][1])
nltk.pos_tag(song)

Even if you can scrape the site, doesn't mean that you should, instead you can use the API from genius , just create the access token from the Genius API site 即使您可以抓取网站,也不意味着您应该使用来自天才的API,只需从Genius API网站创建访问令牌

import lyricsgenius as genius #calling the API
api=genius.Genius('youraccesstokenhere12345678901234567890isreallylongiknow')
artist=api.search_artist('The artist name here')
aux=artist.save_lyrics(format='json', filename='artist.txt',overwrite=True, skip_duplicates=True,verbose=True)#you can change parameters acording to your needs,i dont recommend using this file directly because it saves a lot of data that you might not need and will take more time to clean it

titles=[song['title'] for song in aux['songs']]#in this case for example i just want title and lyrics
lyrics=[song['lyrics'] for song in aux['songs']]
thingstosave=[]
for i in range(0,128):
    thingstosave.append(titles[i])
    thingstosave.append(lyrics[i])
with open("C:/whateverfolder/alllyrics.txt","w") as output:
    output.write(str(thingstosave))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM