簡體   English   中英

如何基於R中的因子水平計算數據幀中值的頻率?

[英]How do I count the frequency of a value in a data frame based on a factor level in R?

我有一個法律數據集,其中所有列均由因子表示

> str(df)
'data.frame':   2101 obs. of  4 variables:
 $ specialty: Factor w/ 5 levels "Real Estate",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ col1     : Factor w/ 161 levels "10060","11404",..: 95 40 72 52 72 72 72 161 161 161 ...
 $ col2     : Factor w/ 138 levels "0277T","11602",..: 63 18 76 29 138 50 138 138 138 138 ...
 $ col3     : Factor w/ 106 levels "10061","10160",..: 44 58 106 51 106 58 106 106 106 106 ...

1-3列由對應於特定法律程序的5位代碼組成。 代碼可以在同一列中或在不同列中重復。 代碼是作為因素組織的。 我有興趣獲取一組7個代碼的頻率, [49585, 44310, 44320, 38564, 44125, 44150, 49419]

> head(df)
   specialty   col1  col2  col3
1 Bankruptcy  49585 49000 44950
2 Tort        44140 38564 49255
3 Real Estate 49000 49419  NULL
4 Bankruptcy  44310 44120 49000
5 Real Estate 49000  NULL  NULL
6 Tort        49000 44950 49255

但是,我只想獲取與特殊列中的兩個特定級別關聯的代碼頻率: "Tort""Real Estate" 由於存在因素,這很棘手。 僅當它們與上述任一級別在同一行中出現時,如何才能找到集合中每個代碼的計數?

預期產量:

**Counts**        49585  44310  44320  38564  44125  44150  49419
Tort                 12    230    232      1     21      2     23
Real Estate         280     50     40     92    121     12    726

可能你需要

df1 <- subset(df, specialty %in% c('Real Estate', 'Tort'))
library(reshape2)
dM <- melt(df1, id.var='specialty')[,-2]
dM[] <- lapply(dM, factor)
table(dM)
#             value
#specialty     38564 44140 44950 49000 49255 49419 NULL
# Real Estate     0     0     0     2     0     1    3
# Tort            1     1     1     1     2     0    0

要么

res <- recast(df1, id.var='specialty', specialty~value, length)
res 
#    specialty 38564 44140 44950 49000 49255 49419 NULL
#1 Real Estate     0     0     0     2     0     1    3
#2        Tort     1     1     1     1     2     0    0

res[c(TRUE,!colSums(!res[-1]))]
#    specialty 49000
#1 Real Estate     2
#2        Tort     1

數據

df1 <- structure(list(specialty = structure(c(1L, 3L, 2L, 1L, 2L, 3L
), .Label = c("Bankruptcy", "Real Estate", "Tort"), class = "factor"), 
col1 = structure(c(4L, 1L, 3L, 2L, 3L, 3L), .Label = c("44140", 
"44310", "49000", "49585"), class = "factor"), col2 = structure(c(4L, 
1L, 5L, 2L, 6L, 3L), .Label = c("38564", "44120", "44950", 
"49000", "49419", "NULL"), class = "factor"), col3 = structure(c(1L, 
3L, 4L, 2L, 4L, 3L), .Label = c("44950", "49000", "49255", 
"NULL"), class = "factor")), .Names = c("specialty", "col1", 
"col2", "col3"), row.names = c("1", "2", "3", "4", "5", "6"),
class = "data.frame")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM