简体   繁体   English

来自数据表中具有分组的多个行和列的中位数

[英]Median from multiple rows and columns in a data table with grouping

I have a data table with over 90000 observations and 1201 variables. 我有一个包含超过90000个观测值和1201个变量的数据表。 All columns except the last one store numeric values, the last column is the column with names of source files (over 100). 除最后一列之外的所有列都存储数值,最后一列是包含源文件名称的列(超过100)。 Here is a small sample of the data table: 以下是数据表的一小部分示例:

library(data.table)
DT <- data.table(V1=sample(0:100,20,replace=TRUE), 
V2=sample(0:100,20,replace=TRUE), V3=sample(0:100,20,replace=TRUE), 
V4=sample(0:100,20,replace=TRUE), V5=sample(0:100,20,replace=TRUE), 
V6=sample(0:100,20,replace=TRUE), V7=sample(0:100,20,replace=TRUE), 
file=rep(c("A","B","C","D"), each = 5))

What I want to do is to calculate a median of ALL values in each group ( file ). 我想要做的是计算每组( file )中所有值的中位数。 So eg for group A the median would be calculated from rows 1,2,3,4,5 at once. 因此,例如对于组A,中值将立即从行1,2,3,4,5计算。 In the next step, I would like to assign the medians to each of the rows depending on a group (expected output below). 在下一步中,我想根据一个组(下面的预期输出)将中位数分配给每一行。

The question seems to be simple, I have googled many similar questions regarding median/mean calculation depending on a group ( aggregate as one of the most popular solutions). 这个问题似乎很简单,我根据一个群体( aggregate作为最受欢迎的解决方案之一)搜索了许多关于中位数/平均值计算的类似问题。 However, in all cases only one column is taken into account for the median calculation. 但是,在所有情况下,中值计算仅考虑一列。 Here are 7 (or in my original data 1200) and median does not accept that - I should provide a numerical vector. 这是7(或在我的原始数据1200)和median不接受 - 我应该提供数字向量。 Therefore I have experimented with unlist , aggregate , dplyr package, tapply with any luck... 因此,我已经尝试了unlistaggregatedplyr包, tapply如果运气好的话...

Due to the amount of data and groups (ie file ) the code should be quite automatic and efficient... I would really appreciate your help! 由于数据和组(即file )的数量,代码应该是非常自动和有效的......我真的很感谢你的帮助!

Just a small example if the code which obviously has failed: 如果显然失败的代码只是一个小例子:

DT_median <- setDT(DT)[, DT_med := median(DT[,1:7]), by = file]

The expected result should look like this: 预期结果应如下所示:

V1  V2  V3  V4  V5  V6  V7  file DT_med
42  78  9   0   60  46  65  A    37.5
36  36  46  45  5   96  64  A    37.5
83  31  92  100 15  2   9   A    37.5
36  16  49  82  32  4   46  A    37.5
29  17  39  6   62  52  97  A    37.5
37  70  17  90  8   10  93  B    47
72  62  68  83  96  77  20  B    47
10  47  29  2   93  16  30  B    47
69  87  7   47  96  17  8   B    47
23  70  72  27  10  86  49  B    47
78  51  13  33  56  6   39  C    51
28  92  100 5   75  33  17  C    51
71  82  9   20  34  83  22  C    51
62  40  84  87  37  45  34  C    51
55  80  55  94  66  96  12  C    51
93  1   99  97  7   77  6   D    41
53  55  71  12  19  25  28  D    41
27  25  28  89  41  22  60  D    41
91  25  25  57  21  98  27  D    41
2   63  17  53  99  65  95  D    41

As we want to calculate the median from all the values, grouped by 'file', unlist the Subset of Data.table ( .SD ), get the median and assign ( := ) the output to create the new column 'DT_med' 由于我们想要计算所有值的median ,按'文件'分组, unlist Data.table( .SD )的子集,获取中median并分配( := )输出以创建新列'DT_med'

library(data.table)
DT[, DT_med := median(unlist(.SD), na.rm = TRUE), by = file]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM