簡體   English   中英

如何按R中的因子水平匯總數據

[英]How to summarize the data by factor levels in r

我有以下數據,我想按因子水平匯總(min / max / mean / median / mode / sd date),這是cluster.kmeans

head(MS.DATA.IMPVAR.KMEANS,10)
     subscribers   arpu     handset3g    mou     rechargesum  cluster.kmeans
 1       105822 197704.10     19040 2854801.0      235430              5
 2        18210  34799.21      2856  419109.0       39820              6
 3        71351 133842.38     13056 2021183.0      157099              3
 4        44975 104681.58      9439 1303220.6      121697              2
 5        75860 133190.55     12605 1714640.8      144262              5
 6        63740 119389.91     11067 1651303.2      143333              1
 7        59368 117792.03     11747 1690910.7      136902              5
 8        40064  80427.09      7217  886214.5       89226              2
 9        51966  99385.52      9972 1407985.7      117353              5
 10       70811 141131.66     12362 1373104.7      158206              4

我嘗試使用dplyr,結果如下:

s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)   
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)    
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)    
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)

head(MS.DATA.STATS.KMEANS)
 A tibble: 6 × 7
   cluster.kmeans    variable       max      mean    median       min
           <fctr>       <chr>     <dbl>     <dbl>     <dbl>     <dbl>
 1              1        arpu  250153.5 164652.99 163718.33  88306.53
 2              1   handset3g   21809.0  13736.38  13598.00   6936.00
 3              1         mou 1143639.1 338834.54 313010.20 116523.59
 4              1 rechargesum  270169.0 173397.03 171897.00  89080.00
 5              1 subscribers   41428.0  26515.01  26321.00  13794.00
 6              2        arpu  163566.9  84552.09  82402.23  29477.03

我想在這里跟我不使用dplyr代碼行數更少一些其他的方式做......用像基礎R功能by .. aggregate等....

尚不清楚是較少的代碼行還是base R是優先級。 但是,使用當前的Hadleyverse格式,我們可以將代碼放在%>%內,並使用separate而不是兩個sapply步驟來使代碼更緊湊

library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
    group_by(cluster.kmeans) %>%
    summarise_all(funs(mean, median, min, max, sd)) %>%
    gather(key, value, -cluster.kmeans) %>%
    separate(key, into = c("variable", "stats")) %>% 
    spread(stats, value)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM