简体   繁体   English

将Audioscrobbler数据读入R

[英]reading Audioscrobbler data into R

I'm having problems reading a medium size data set into R. 我在将中等大小的数据集读入R时遇到问题。

The data set is a database published by Audioscrobbler, now merged with Last.fm, about which music users listened. 该数据集是由Audioscrobbler发布的数据库,现在与Last.fm合并,用户可以听听音乐。 The data set is available here , and it has three data sets: the main (and bigger) one, with user id, artist id and how many times the user listened a given artist. 数据集在这里可用,它有三个数据集:主要(和更大)数据集,用户ID,艺术家ID和用户听取给定艺术家的次数。 The second one has two columns: artist id and the name of the artist. 第二个列有两列:艺术家ID和艺术家的名字。 That's the data set I'm having problem with. 那是我遇到问题的数据集。

The data set seems to be ill formatted and I don't know what do do to read it. 数据集似乎格式不正确,我不知道如何阅读它。

I tried this: 我试过这个:

test <- scan("artist_data.txt", what=list("numeric", "character"), fill=T)

However, it returns a list, with data not well separated and it says "Read 18996 records", when I suspect there are more records (though I'm not sure, since I can't read the data!). 然而,它返回一个列表,数据没有很好地分开,当我怀疑有更多记录时,它会显示“读取18996条记录”(虽然我不确定,因为我无法读取数据!)。

Any ideas? 有任何想法吗?

Sorry for not giving an easy reproducible example, but since I can't read the data, I don't know how to give a reproducible example (and I know this will make it difficult for you guys to give an answer. But you can download the data set, though it may take some time. Sorry again). 很抱歉没有给出一个简单的可重复的例子,但由于我无法读取数据,我不知道如何给出一个可重复的例子(我知道这会让你们很难给出答案。但你可以下载数据集,虽然可能需要一些时间。再次抱歉)。

This dataset is a total mess! 这个数据集是一团糟!

A few of the problems are (for anyone more ambitious or knowledgeable who is able to answer this question): 一些问题是(对于任何能够回答这个问题的雄心勃勃或知识渊博的人):

  • strange characters and symbols in the artist names (you'll need to use encoding="UTF-8" when you read the file in) 艺术家名称中的奇怪字符和符号(当您阅读文件时,您需要使用encoding="UTF-8"
  • some items even read right-to-left (not sure how to fix that!) 有些项目甚至从右到左阅读(不知道如何解决这个问题!)
  • several of the artist names have actual tabs in them 一些艺术家的名字中都有实际的标签
  • several of the items have "\\t" in their names making it hard to basic searching without first searching and replacing all of those 其中一些项目的名称中包含“\\ t”,这使得在没有首先搜索和替换所有这些内容的情况下难以进行基本搜索
  • some of the artist names are on more than one line (leading to a line that has only the last part of the artist name)(and, yes, word-wrap is OFF) 一些艺术家的名字在多行上(导致一行只有艺术家名字的最后一部分)(并且,是的,自动换行是关闭的)

My suggestion is to first do a lot of cleaning up with a good text editor (I used SciTE without any problems). 我的建议是首先用一个好的文本编辑器进行大量的清理(我没有任何问题地使用过SciTE)。 Some of the basic cleaning up that I had to do to get the entire file to load included removing the extra tabs (there should just be one tab separating the artist ID and the artist name), using some regular expressions to remove lines that did not start with a number, and making sure that all the line-endings were the same (the source file has different line-endings in certain places). 我必须做的一些基本的清理工作,包括删除额外的标签(应该只有一个标签分隔艺术家ID和艺术家名称),使用一些正则表达式删除没有的行从数字开始,并确保所有行结尾都相同(源文件在某些​​位置具有不同的行结尾)。

After that, your best bet might be loading the data, finding problem rows (R should tell you when it encounters an error), fixing them in your text editor, reloading the data, finding problem rows... until you get no errors, using: 在那之后,你最好的选择可能是加载数据,找到问题行(R应该在遇到错误时告诉你),在文本编辑器中修复它们,重新加载数据,找到问题行......直到你没有错误,使用:

artist.data = read.delim("artist_data.txt", header=F, sep="\t", encoding="UTF-8")

I was actually able to open my semi-cleaned text file in Gnumeric, where I was able to spot a few more problems after I sorted the lines in ascending order, but I don't think that's required. 我实际上能够在Gnumeric中打开我的半清洁文本文件,在按升序对行进行排序之后我能够发现一些问题,但我不认为这是必需的。

Even after doing all of this, your dataset will still be a mess, if only because not all the artist names were recorded correctly in the Audioscrobbler database due to poor tag management. 即使在完成所有这些操作之后,由于标签管理不佳,因为并非所有艺术家名称都在Audioscrobbler数据库中正确记录,因此您的数据集仍然会变得一团糟。 Thus, you will likely have artists such as "02Nine ihch Nalis-Heard like". 因此,你可能会有艺术家,如“02Nine ihch Nalis-Heard like”。

If anyone can suggest an efficient way of cleaning this data, I'd love to learn it! 如果有人能建议一种有效的方法来清理这些数据,我很乐意学习它! It seems like it would be useful to know. 看来知道它会很有用。

This should (might) work: 这应该(可能)有效:

ad <- readLines(pipe("sed artist_data.txt -e 's!\\x0D!!g'", open="rb"))
library("gsubfn")
addf <- strapply(ad, "^([^\\t]*)\\t(.*)$", c, simplify=rbind)

The first part does take care of the embedded control-M's, and the second tries to split on just the first tab (but not any subsequent ones). 第一部分确实处理嵌入式控件-M,第二部分尝试仅拆分第一个选项卡(但不包括任何后续选项卡)。

It is not fast at all. 它根本不快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM