简体   繁体   中英

Transforming Dataset into value matrix

Sorry about the hopeless title..

I have a dataset that looks like:

|userId|movieId|rating|genre1|genre2|
|1     |13     |3.5   |1     |0     |
|1     |412    |2.5   |1     |1     |
|2     |4      |3.0   |0     |1     |
|3     |412    |2.5   |1     |1     |
|4     |13     |4.5   |1     |0     |
|4     |412    |5     |1     |1     |

And so on...

Not every user has rated every movie.

I want to transform this into a matrix that looks like:

|   |1  |2  |3  |4  |
|4  |   |3  |   |   |
|13 |2.5|   |   |4.5|
|412|   |   |   |5  |

So I have userId as the columns and movieId as the rows with the associated value being the rating given.

What's the best way of doing this?

Edit: The id's are non-sequential. There are 140k users and 28k movies.

If you have several users and several movies, you could easily run out of memory in building a matrix . For instance say that users are 1000 and the different movies are 1000. You'll end up with a matrix containing 1M entries, most of them will be missing (since not every users saw every movie).

If your dataset is big, a sparseMatrix from the Matrix package is the way to go. If both users and movies id's are sequential (ie they start with 1 and finish with the number of different entries), building it is straightforward. Using @StevenBeaupré data :

require(Matrix)
mat<-sparseMatrix(df$userId,df$movieId,x=df$rating)

If the id's are not sequential:

mat<-sparseMatrix(as.integer(factor(df$userId)), 
                  as.integer(factor(df$movieId)),x=df$rating)

You can basically perform any matrix operation on a sparseMatrix too.

Try

library(dplyr)
library(tidyr)

df %>%
  select(-(genre1:genre2)) %>%
  spread(userId, rating, fill = "")

Which gives:

#  movieId   1 2   3   4
#1       4     3        
#2      13 3.5       4.5
#3     412 2.5   2.5   5

Data

df <- structure(list(userId = c(1L, 1L, 2L, 3L, 4L, 4L), movieId = c(13L, 
412L, 4L, 412L, 13L, 412L), rating = c(3.5, 2.5, 3, 2.5, 4.5, 
5), genre1 = c(1L, 1L, 0L, 1L, 1L, 1L), genre2 = c(0L, 1L, 1L, 
1L, 0L, 1L)), .Names = c("userId", "movieId", "rating", "genre1", 
"genre2"), class = "data.frame", row.names = c(NA, -6L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM