I have a data.table
of events recording, say, user ID, country of residence, and event. Eg,
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here). So when I try to summarize the data:
dt[, country[.N] , by=user]
user V1
1: 3 2
2: 4 2
I get the wrong country for user 3. Ideally, I would like to get the most common country for a user and the percentage of time he spent there:
user country support
1: 3 1 0.8
2: 4 2 1.0
How do I do that?
The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table
and not data.frame
after all).
Another way:
Edited. table(.)
was the culprit. Changed it to complete data.table
syntax.
dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)],
max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
# user country support
# 1: 3 1 0.8
# 2: 4 2 1.0
Using plyr
's count
function:
dt[, count(country), by = user][order(-freq),
list(country = x[1],
support = freq[1]/sum(freq)),
by = user]
# user country support
#1: 4 2 1.0
#2: 3 1 0.8
Idea is to count the countries per user, order by max frequency and then get the data you like.
A smarter answer thanks to @mnel, that doesn't use extra functions:
dt[, list(freq = .N),
by = list(user, country)][order(-freq),
list(country = country[1],
support = freq[1]/sum(freq)),
by = user]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.