Summarize a data.table with unreliable data

Question

I have a data.table of events recording, say, user ID, country of residence, and event. Eg,

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")

As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here). So when I try to summarize the data:

dt[, country[.N] , by=user]
   user V1
1:    3  2
2:    4  2

I get the wrong country for user 3. Ideally, I would like to get the most common country for a user and the percentage of time he spent there:

   user country support
1:    3       1     0.8
2:    4       2     1.0

How do I do that?

The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table and not data.frame after all).

Answer 1

Another way:

Edited. table(.) was the culprit. Changed it to complete data.table syntax.

dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)], 
               max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
#    user country support
# 1:    3       1     0.8
# 2:    4       2     1.0

Answer 2

Using plyr 's count function:

dt[, count(country), by = user][order(-freq),
                                list(country = x[1],
                                     support = freq[1]/sum(freq)),
                                by = user]
#   user country support
#1:    4       2     1.0
#2:    3       1     0.8

Idea is to count the countries per user, order by max frequency and then get the data you like.

A smarter answer thanks to @mnel, that doesn't use extra functions:

dt[, list(freq = .N),
     by = list(user, country)][order(-freq),
                               list(country = country[1],
                                    support = freq[1]/sum(freq)),
                               by = user]

Summarize a data.table with unreliable data

Question

2 answers

solution1
7 ACCPTED 2013-04-24 20:00:44

solution2
4 2013-04-24 19:53:44

Summarize a data.table with unreliable data

Question

2 answers

solution1 7 ACCPTED 2013-04-24 20:00:44

solution2 4 2013-04-24 19:53:44

solution1
7 ACCPTED 2013-04-24 20:00:44

solution2
4 2013-04-24 19:53:44