简体   繁体   English

dplyr中的group_by函数错误

[英]Error in group_by function in dplyr

I've looked through the related dplyr questions, the R documentation, and attempted to sort through what I believe is a syntax misunderstanding. 我浏览了相关的dplyr问题,R文档,并试图对我认为是语法误解的内容进行排序。

Here is sample data that reflects the strx of my data. 这是反映我数据strx的示例数据。

id <- c(1:20)
xvar <- seq(from=2.0, to=6.0, length.out=100)
yvar <- c(1:100)
binary <- sample(x=c(0,1), size=100, replace=TRUE)

breaks <- c(0,11,21,31,41,51,61,71,81,91,100)
df <- data.frame(id, xvar, yvar, binary)
df <- transform(df, bin=cut(yvar, breaks)) 

     id     xvar yvar binary    bin
1  1 2.000000    1      1 (0,11]
2  2 2.040404    2      0 (0,11]
3  3 2.080808    3      0 (0,11]
4  4 2.121212    4      0 (0,11]
5  5 2.161616    5      1 (0,11]
6  6 2.202020    6      0 (0,11]

I'd like to run the following, looking at how the xvar means, divided by the binary variable, are significantly different based on the bin group they belong to. 我想运行以下命令,以xvar除以binary变量表示的含义为基础,根据它们所属的bin组有何显着不同。

pval <- df %>% group_by(bin) %>% summarise(p.value=t.test(xvar ~ factor(binary))$p.value)

However, I continue to get the error: "grouping factor must have exactly 2 levels" 但是,我继续收到错误:“分组因子必须恰好具有2个级别”

I saw a similar post to this, but the problem was how the T.test was being run. 我看到了与此类似的帖子,但是问题是T.test的运行方式。 I've ran this same code using a different group_by object and it worked just fine. 我已经使用不同的group_by对象运行了相同的代码,并且效果很好。 The data time was a factor and everything. 数据时间是一个重要因素。

Any thoughts? 有什么想法吗? I also would appreciate critiques on how to improve the manner in which this question was posed. 我也希望对如何改善提出这个问题的方式提出批评。

You don't want to use dplyr for this. 您不想为此使用dplyr。 You want to fit a linear model . 您想拟合线性模型

mod <- lm(xvar ~ binary*bin, data=df)
anova(mod)

For further discussion of what the coefficients, P-values and sums of squares mean, consider asking on stats.SE. 要进一步讨论系数,P值和平方和的含义,请考虑询问stats.SE。

I think I've resolved the issue. 我想我已经解决了这个问题。

"Grouping factor must have exactly 2 levels" comes from whenever there is not enough data in the t.test. 只要t.test中没有足够的数据,就会出现“分组因子必须具有准确的2个级别”。 I just assumed my original data set, which is large, would have enough to not run into this issue. 我只是假设我的原始数据集很大,足以避免出现此问题。

When I made the sample data more robust, the error disappeared. 当我使样本数据更可靠时,错误消失了。

Sorry for the wasted time, and thank you for your help! 很抱歉浪费时间,谢谢您的帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM