从URL检索整个歌词

Question

I am trying to retrieve the whole lyrics of a band from the web. 我正在尝试从网络上检索乐队的全部歌词。 I have noticed that they build URLs using ".../firstletter/bandname/songname.html" 我注意到他们使用".../firstletter/bandname/songname.html"构建URL

Here is an example. 这是一个例子。

http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html

I was thinkining about creating a function that would read.csv the URLs. 我正在考虑创建一个将read.csv URL的函数。 That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. 这部分很容易，因为我可以通过简单的复制粘贴获得标题并将其另存为.csv。 Then, use that vector to pass the function for each value in order to construct the URL name. 然后，使用该向量为每个值传递函数，以构造URL名称。

But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric. 但是我试图阅读第一个，只是看它是什么样子，我发现如果我的目标是用每个歌词构建一个csv文件，那么“清理数据”将太多。

x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))

I think my approach is not the best (or maybe I need a better data cleaning strategy) 我认为我的方法不是最好的方法（或者也许我需要更好的数据清理策略）

Answer 1

The HTML page has a tell on where the lyrics begin: HTML页面上会告诉您歌词的开始位置：

Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. 我们的许可协议禁止任何第三方歌词提供商使用azlyrics.com内容。 Sorry about that. 对于那个很抱歉。

Taking advantage of that, you can detect this string, and then read everything up to the end of the div : 利用这一点，您可以检测到此字符串，然后读取div末尾的所有内容：

m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")

giveaway <- "Sorry about that."
#You can add the full line in case you think one of the lyrics might have this sentence in it.

start <- grep(giveaway, m) + 1 # Where the lyric starts
end <- grep("</div>", m[start:length(m)])[1] + start
# Take the first </div> after the start of the lyric, and then fix the position by adding the start

lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n") 
#This is just an example of how to clear the remaining tags and join the text.

And then: 接着：

> cat(lyrics) #using cat() prints the line breaks
Ridin' down the highway
Goin' to a show
Stop in all the byways
Playin' rock 'n' roll 
.
.
.
Well it's a long way
It's a long way, you should've told me
It's a long way, such a long way

Answer 2

Assuming that "cleaning the data" means you would be parsing through html tags. 假设“清理数据”意味着您将通过html标签进行解析。 I recommend using DOM scraping library that would extract only the text lyrics from the page and save those lyrics to CSV, database or wherever. 我建议使用DOM抓取库，该库将仅从页面中提取文本歌词并将这些歌词保存到CSV，数据库或任何地方。 That way you wouldn't have to do any data cleaning. 这样，您将不必进行任何数据清理。 I don't know what programming language your using, but a simple google search will show you a lot of DOM querying and parsing libraries for any language. 我不知道您使用的是哪种编程语言，但是简单的Google搜索将为您显示许多针对任何语言的DOM查询和解析库。 Here is an example with PHP 这是PHP的一个例子

http://simplehtmldom.sourceforge.net/manual.htm http://simplehtmldom.sourceforge.net/manual.htm

$html = file_get_html('http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html');

// Find all images 
$lyrics = $html->find('div.ringtone',1)->next_sibling();
print($lyrics.innertext);

now you have lyrics. 现在你有了歌词。 Save Them.(code not tested); 保存它们（代码未经测试）；

If your using the R-Language. 如果您使用的是R语言。 Use this library here. 在此使用此库。 You will be able to query the DOM and extract the lyrics easily. 您将能够查询DOM并轻松提取歌词。 https://github.com/hadley/rvest https://github.com/hadley/rvest

从URL检索整个歌词

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-05-04 01:32:04

解决方案2
1 2015-05-03 23:48:15

从URL检索整个歌词

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-05-04 01:32:04

解决方案2 1 2015-05-03 23:48:15

解决方案1
2 已采纳 2015-05-04 01:32:04

解决方案2
1 2015-05-03 23:48:15