简体   繁体   中英

reading Audioscrobbler data into R

I'm having problems reading a medium size data set into R.

The data set is a database published by Audioscrobbler, now merged with Last.fm, about which music users listened. The data set is available here , and it has three data sets: the main (and bigger) one, with user id, artist id and how many times the user listened a given artist. The second one has two columns: artist id and the name of the artist. That's the data set I'm having problem with.

The data set seems to be ill formatted and I don't know what do do to read it.

I tried this:

test <- scan("artist_data.txt", what=list("numeric", "character"), fill=T)

However, it returns a list, with data not well separated and it says "Read 18996 records", when I suspect there are more records (though I'm not sure, since I can't read the data!).

Any ideas?

Sorry for not giving an easy reproducible example, but since I can't read the data, I don't know how to give a reproducible example (and I know this will make it difficult for you guys to give an answer. But you can download the data set, though it may take some time. Sorry again).

This dataset is a total mess!

A few of the problems are (for anyone more ambitious or knowledgeable who is able to answer this question):

  • strange characters and symbols in the artist names (you'll need to use encoding="UTF-8" when you read the file in)
  • some items even read right-to-left (not sure how to fix that!)
  • several of the artist names have actual tabs in them
  • several of the items have "\\t" in their names making it hard to basic searching without first searching and replacing all of those
  • some of the artist names are on more than one line (leading to a line that has only the last part of the artist name)(and, yes, word-wrap is OFF)

My suggestion is to first do a lot of cleaning up with a good text editor (I used SciTE without any problems). Some of the basic cleaning up that I had to do to get the entire file to load included removing the extra tabs (there should just be one tab separating the artist ID and the artist name), using some regular expressions to remove lines that did not start with a number, and making sure that all the line-endings were the same (the source file has different line-endings in certain places).

After that, your best bet might be loading the data, finding problem rows (R should tell you when it encounters an error), fixing them in your text editor, reloading the data, finding problem rows... until you get no errors, using:

artist.data = read.delim("artist_data.txt", header=F, sep="\t", encoding="UTF-8")

I was actually able to open my semi-cleaned text file in Gnumeric, where I was able to spot a few more problems after I sorted the lines in ascending order, but I don't think that's required.

Even after doing all of this, your dataset will still be a mess, if only because not all the artist names were recorded correctly in the Audioscrobbler database due to poor tag management. Thus, you will likely have artists such as "02Nine ihch Nalis-Heard like".

If anyone can suggest an efficient way of cleaning this data, I'd love to learn it! It seems like it would be useful to know.

This should (might) work:

ad <- readLines(pipe("sed artist_data.txt -e 's!\\x0D!!g'", open="rb"))
library("gsubfn")
addf <- strapply(ad, "^([^\\t]*)\\t(.*)$", c, simplify=rbind)

The first part does take care of the embedded control-M's, and the second tries to split on just the first tab (but not any subsequent ones).

It is not fast at all.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM