在R中使用split函数

Question

I am trying to simulate three small datasets, which contains x1 , x2 , x3 , x4 , trt and IND . 我试图模拟三个小数据集，其中包含x1 ， x2 ， x3 ， x4 ， trt和IND 。

However, when I try to split simulated data by IND using "split" in RI get Warning messages and outputs are correct. 但是，当我尝试使用RI中的“拆分”来分割模拟数据时，获取警告消息和输出是正确的。 Could someone please give me a hint what I did wrong in my R code? 有人可以给我一个暗示我在R代码中做错了什么吗？

# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run

# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed   #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################

p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
             (1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
                exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))

p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
             (1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
                exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))

p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
          exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))

# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)

IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)

Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")

Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)

The warning message: 警告信息：

> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
  data length is not a multiple of split variable

and the output becomes 并且输出变为

$`2`
            x1          x2          x3         x4 trt IND
141  1.0894068  0.09765185 -0.46702047  0.4049424   A   3
145 -1.2953113 -1.94291045  0.09926239 -0.5338715   A   3
148  0.0274979  0.72971804  0.47194731 -0.1963896   A   3

$`3`
[1] x1  x2  x3  x4  trt IND
<0 rows> (or 0-length row.names)

I have checked my R code several times however, I can't figure out what I did wrong. 我已经多次检查了我的R代码，但我无法弄清楚我做错了什么。 Many thanks in advance 提前谢谢了

Answer 1

IND is the global variable for the full data, sim . IND是完整数据的全局变量sim 。 You want to use the specific one for the subset, eg 您希望将特定的一个用于子集，例如

Anew <- split(Aset, f = Aset$IND)

Answer 2

It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do. 这是一个警告，而不是错误，这意味着成功执行split ，但可能没有完成您想要做的事情。 From the "details" section of the help file: 从帮助文件的“详细信息”部分：

f is recycled as necessary and if the length of x is not a multiple of the length of fa warning is printed. f根据需要进行再循环，如果x的长度不是fa警告长度的倍数，则打印出来。 Any missing values in f are dropped together with the corresponding values of x. f中的任何缺失值都与x的相应值一起删除。

Try checking the length of your IND against the size of your dataframe, maybe. 尝试检查IND的长度与数据帧的大小。

Answer 3

Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package. 一旦你的数据被拆分，不确定你的目标是什么，但这听起来像是plyr包的一个很好的候选者。

> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
  trt IND      x1mean      x2sum     x3min     x4max
1   A   1 -0.49356448 -1.5650528 -1.016615 2.0027822
2   A   2  0.05908053  5.1680463 -1.514854 0.8184445
3   A   3  0.22898716  1.8584443 -1.934188 1.6326763
4   B   1  0.01531230  1.1005720 -2.002830 2.6674931
5   B   2  0.17875088  0.2526760 -1.546043 1.2021935
6   B   3  0.13398967 -4.8739380 -1.565945 1.7887837
7   C   1 -0.16993037 -0.5445507 -1.954848 0.6222546
8   C   2 -0.04581149 -6.3230167 -1.491114 0.8714535
9   C   3 -0.41610973  0.9085831 -1.797661 2.1174894
>

Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. 您可以在其中为任何返回data.frame函数或可以强制转换为一个的函数替换summarise及其后续参数。 If lists are the target, ldply is your friend. 如果列表是目标， ldply就是你的朋友。

在R中使用split函数

问题描述

3 个解决方案

解决方案1
5 已采纳 2012-02-01 17:29:44

解决方案2
4 2012-02-01 17:27:58

解决方案3
1 2012-02-01 17:30:50

在R中使用split函数

问题描述

3 个解决方案

解决方案1 5 已采纳 2012-02-01 17:29:44

解决方案2 4 2012-02-01 17:27:58

解决方案3 1 2012-02-01 17:30:50

解决方案1
5 已采纳 2012-02-01 17:29:44

解决方案2
4 2012-02-01 17:27:58

解决方案3
1 2012-02-01 17:30:50