I'm having problems reading a medium size data set into R.
The data set is a database published by Audioscrobbler, now merged with Last.fm, about which music users listened. The data set is available here , and it has three data sets: the main (and bigger) one, with user id, artist id and how many times the user listened a given artist. The second one has two columns: artist id and the name of the artist. That's the data set I'm having problem with.
The data set seems to be ill formatted and I don't know what do do to read it.
I tried this:
test <- scan("artist_data.txt", what=list("numeric", "character"), fill=T)
However, it returns a list, with data not well separated and it says "Read 18996 records", when I suspect there are more records (though I'm not sure, since I can't read the data!).
Any ideas?
Sorry for not giving an easy reproducible example, but since I can't read the data, I don't know how to give a reproducible example (and I know this will make it difficult for you guys to give an answer. But you can download the data set, though it may take some time. Sorry again).
This dataset is a total mess!
A few of the problems are (for anyone more ambitious or knowledgeable who is able to answer this question):
encoding="UTF-8"
when you read the file in) My suggestion is to first do a lot of cleaning up with a good text editor (I used SciTE without any problems). Some of the basic cleaning up that I had to do to get the entire file to load included removing the extra tabs (there should just be one tab separating the artist ID and the artist name), using some regular expressions to remove lines that did not start with a number, and making sure that all the line-endings were the same (the source file has different line-endings in certain places).
After that, your best bet might be loading the data, finding problem rows (R should tell you when it encounters an error), fixing them in your text editor, reloading the data, finding problem rows... until you get no errors, using:
artist.data = read.delim("artist_data.txt", header=F, sep="\t", encoding="UTF-8")
I was actually able to open my semi-cleaned text file in Gnumeric, where I was able to spot a few more problems after I sorted the lines in ascending order, but I don't think that's required.
Even after doing all of this, your dataset will still be a mess, if only because not all the artist names were recorded correctly in the Audioscrobbler database due to poor tag management. Thus, you will likely have artists such as "02Nine ihch Nalis-Heard like".
If anyone can suggest an efficient way of cleaning this data, I'd love to learn it! It seems like it would be useful to know.
This should (might) work:
ad <- readLines(pipe("sed artist_data.txt -e 's!\\x0D!!g'", open="rb"))
library("gsubfn")
addf <- strapply(ad, "^([^\\t]*)\\t(.*)$", c, simplify=rbind)
The first part does take care of the embedded control-M's, and the second tries to split on just the first tab (but not any subsequent ones).
It is not fast at all.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.