简体   繁体   English

因子水平出现的计数顺序

[英]Count Order of Factor Level Occurance

My transactional dataset contains several variables including an ID number, date, and transaction area (factor): 我的交易数据集包含几个变量,包括ID号,日期和交易区域(因素):

    id<-as.integer(rep(c(1,2,3,4,5),times=20))
    date<-rep(seq(as.Date("2011-07-01"),by="day", length.out=100))
    category<-rep(as.factor(letters[seq( from = 1, to = 4 )]),times=25)

    transactions<-data.frame(id, date, category)

    head(transactions)
      id       date category
       1 2011-07-01        a
       2 2011-07-02        b
       3 2011-07-03        c
       4 2011-07-04        d
       5 2011-07-05        a
       1 2011-07-06        b

What I would like to do on a per-ID basis is determine the order of the factor appearance without recounting a factor that has already appeared. 我想基于每个ID进行的操作是确定因素出现的顺序,而无需重新计算已经出现的因素。

    solution <- transactions %>%
                group_by(id, date)%>%
                mutate(category_order= (solution))

So I can get: 这样我就可以得到:

     head(transactions)
      id       date category category_order
       1 2011-07-01        a     1
       1 2011-07-06        b     2
       1 2011-07-11        c     3
       1 2011-07-16        d     4
       1 2011-07-21        a     1
       1 2011-07-26        b     2

For each ID if a category repeats, it must have the same order value. 对于每个ID,如果类别重复,则它必须具有相同的订单值。 In the example above a is always 1st, b is always 2nd, etc. 在上面的示例中,a始终为1,b始终为2,依此类推。

What I want to do is count the number of times each category is 1st, 2nd, 3rd, etc. to obtain a frequency distribution for number of times a is 1st, b is 1st, a is 2nd, etc: 我想做的是计算每个类别分别是1st,2nd,3rd等的次数,以获得a为1,b为1,a为2等次数的频率分布:

    head(transactions)
       category category_ order category_order_count
       a     1     5
       a     2     3
       a     3     5
       a     4     4
       b     1     5
       b     2     2

It's probably not complicated, but I am having a mental block because it essentially involves counting an order per ID without repeating a factor level, then summarizing per ID, and finally summarizing per category. 它可能并不复杂,但是我有一个思想上的障碍,因为它本质上涉及在不重复因子级别的情况下对每个ID进行订单计数,然后对每个ID进行汇总,最后对每个类别进行汇总。

Within each id you could set the levels of the factor to the order they appear within that group and then transform the factors to integers via as.numeric to form your new variable. 在每个ID中,您可以将因子的级别设置为它们在该组中出现的顺序,然后通过as.numeric将因子转换为整数以形成新变量。 This relies on the order of the dataset, so if things aren't in order you should arrange by id and date. 这取决于数据集的顺序,因此,如果情况不正确,则应按ID和日期进行arrange

transactions %>%
    arrange(id, date) %>%
    group_by(id) %>%
    mutate(category_order = as.numeric(factor(category, levels = unique(category))))

This can be also done with data.table . 这也可以通过data.table完成。 We convert the 'data.frame' to 'data.table' ( setDT(transactions) ), grouped by 'id', specify the 'i' part with order of 'id' and 'date', convert the 'category' into factor by specifying the levels as the unique elements in 'category', coerce it to 'integer' and assign ( := ) the output to 'category_order' 我们将'data.frame'转换为'data.table'( setDT(transactions) ),按'id'分组,按'id'和'date'的order指定'i'部分,将'category'转换为factor通过指定levels作为unique (在“类别”的元件,它迫使为“整数”和分配:=的输出)为“category_order”

library(data.table)
setDT(transactions)[order(id, date), category_order := as.integer(factor(category,
            levels = unique(category))) , by = id]

head(transactions)
#   id       date category category_order
#1:  1 2011-07-01        a              1
#2:  2 2011-07-02        b              1
#3:  3 2011-07-03        c              1
#4:  4 2011-07-04        d              1
#5:  5 2011-07-05        a              1
#6:  1 2011-07-06        b              2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM