简体   繁体   English

在R中使用split函数

[英]Using split function in R

I am trying to simulate three small datasets, which contains x1 , x2 , x3 , x4 , trt and IND . 我试图模拟三个小数据集,其中包含x1x2x3x4trtIND

However, when I try to split simulated data by IND using "split" in RI get Warning messages and outputs are correct. 但是,当我尝试使用RI中的“拆分”来分割模拟数据时,获取警告消息和输出是正确的。 Could someone please give me a hint what I did wrong in my R code? 有人可以给我一个暗示我在R代码中做错了什么吗?

# Step 2: simulate data
Alpha = 0.05
S = 3 # number of replicates
x = 8 # number of covariates
G = 3 # number of treatment groups
N = 50 # number of subjects per dataset
tot = S*N # total subjects for a simulation run

# True parameters
alpha = c(0.5, 0.8) # intercepts
b1 = c(0.1,0.2,0.3,0.4) # for pi_1 of trt A
b2 = c(0.15,0.25,0.35,0.45) # for pi_2 of trt B
b = c(1.1,1.2,1.3,1.4);
##############################################################################
# Scenario 1: all covariates are independent standard normally distributed   #
##############################################################################
set.seed(12)
x1 = rnorm(n=tot, mean=0, sd=1);x2 = rnorm(n=tot, mean=0, sd=1);
x3 = rnorm(n=tot, mean=0, sd=1);x4 = rnorm(n=tot, mean=0, sd=1);
###############################################################################

p1 = exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4)/
             (1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
                exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))

p2 = exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4)/
             (1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
                exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))

p3 = 1/(1+exp(alpha[1]+b1[1]*x1+b1[2]*x2+b1[3]*x3+b1[4]*x4) +
          exp(alpha[2]+b2[1]*x1+b2[2]*x2+b2[3]*x3+b2[4]*x4))

# To assign subjects to one of treatment groups based on response probabilities
tmp = function(x){sample(c("A","B","C"), 1, prob=x, replace=TRUE)}
trt = apply(cbind(p1,p2,p3),1,tmp)

IND=rep(1:S,each=N) #create an indicator for split simulated data
sim=data.frame(x1,x2,x3,x4,trt, IND)

Aset = subset(sim, trt=="A")
Bset = subset(sim, trt=="B")
Cset = subset(sim, trt=="C")

Anew = split(Aset, f = IND)
Bnew = split(Bset, f = IND)
Cnew = split(Cset, f = IND)

The warning message: 警告信息:

> Anew = split(Aset, f = IND)
Warning message:
In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) :
  data length is not a multiple of split variable

and the output becomes 并且输出变为

$`2`
            x1          x2          x3         x4 trt IND
141  1.0894068  0.09765185 -0.46702047  0.4049424   A   3
145 -1.2953113 -1.94291045  0.09926239 -0.5338715   A   3
148  0.0274979  0.72971804  0.47194731 -0.1963896   A   3

$`3`
[1] x1  x2  x3  x4  trt IND
<0 rows> (or 0-length row.names)

I have checked my R code several times however, I can't figure out what I did wrong. 我已经多次检查了我的R代码,但我无法弄清楚我做错了什么。 Many thanks in advance 提前谢谢了

IND is the global variable for the full data, sim . IND是完整数据的全局变量sim You want to use the specific one for the subset, eg 您希望将特定的一个用于子集,例如

Anew <- split(Aset, f = Aset$IND)

It's a warning, not an error, which means split executed successfully, but may not have done what you wanted to do. 这是一个警告,而不是错误,这意味着成功执行split ,但可能没有完成您想要做的事情。 From the "details" section of the help file: 从帮助文件的“详细信息”部分:

f is recycled as necessary and if the length of x is not a multiple of the length of fa warning is printed. f根据需要进行再循环,如果x的长度不是fa警告长度的倍数,则打印出来。 Any missing values in f are dropped together with the corresponding values of x. f中的任何缺失值都与x的相应值一起删除。

Try checking the length of your IND against the size of your dataframe, maybe. 尝试检查IND的长度与数据帧的大小。

Not sure what your goal is once you have your data split, but this sounds like a good candidate for the plyr package. 一旦你的数据被拆分,不确定你的目标是什么,但这听起来像是plyr包的一个很好的候选者。

> library(plyr)
> ddply(sim, .(trt,IND), summarise, x1mean=mean(x1), x2sum=sum(x2), x3min=min(x3), x4max=max(x4))
  trt IND      x1mean      x2sum     x3min     x4max
1   A   1 -0.49356448 -1.5650528 -1.016615 2.0027822
2   A   2  0.05908053  5.1680463 -1.514854 0.8184445
3   A   3  0.22898716  1.8584443 -1.934188 1.6326763
4   B   1  0.01531230  1.1005720 -2.002830 2.6674931
5   B   2  0.17875088  0.2526760 -1.546043 1.2021935
6   B   3  0.13398967 -4.8739380 -1.565945 1.7887837
7   C   1 -0.16993037 -0.5445507 -1.954848 0.6222546
8   C   2 -0.04581149 -6.3230167 -1.491114 0.8714535
9   C   3 -0.41610973  0.9085831 -1.797661 2.1174894
> 

Where you can substitute summarise and its following arguments for any function that returns a data.frame or something that can be coerced to one. 您可以在其中为任何返回data.frame函数或可以强制转换为一个的函数替换summarise及其后续参数。 If lists are the target, ldply is your friend. 如果列表是目标, ldply就是你的朋友。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM