[英]summarise returning -inf when using na.rm = TRUE
I recently built a simple R script to summarize three different data frames.我最近构建了一个简单的 R 脚本来总结三个不同的数据框。 Since updating to the newest version of R and R Studio, I am running into an output I haven't seen before when using the summarize function in dplyr for only one of the data frames (the other two are fine).自从更新到最新版本的 R 和 R Studio 后,我遇到了以前从未见过的输出,在 dplyr 中仅对其中一个数据帧使用汇总函数(其他两个都很好)。 I also receive a series of warnings that are unfamiliar to me.我还收到了一系列我不熟悉的警告。 Please note that prior to updating, I ran the script exactly as written with no issues for any of the data frames.请注意,在更新之前,我完全按照编写的方式运行脚本,任何数据框都没有问题。
The data frame with the problem is called VO2 and its is set up as follows:有问题的数据框称为VO2,其设置如下:
Name Sex VO2
AthleteA M 50
AthleteA M 52
AthleteA M NA
AthleteB M 49
AthleteB M 56
AthleteB M 47
AthleteC M 42
AthleteC M NA
AthleteC M 41
AthleteD M NA
AthleteD M NA
AthleteD M NA
The code I run is:我运行的代码是:
Test.Summary.VO2 = VO2 %>% group_by(Name, Sex) %>%
summarise(Best.Score = max(VO2, na.rm=TRUE))
This code generates the following summary:此代码生成以下摘要:
Name Sex Best.Score
AthleteA M 52
AthleteB M 56
AthleteC M 42
AthleteD M -Inf
The -Inf value is completely new in the output. -Inf 值在输出中是全新的。 I cannot figure out why it is appearing now for cases where there were only NAs.我无法弄清楚为什么它现在出现在只有 NA 的情况下。
As mentioned above, I have the exact same layout for a second data frame and run the same type of summary.如上所述,我对第二个数据框有完全相同的布局并运行相同类型的摘要。 Here everything works fine.这里一切正常。 When I summarize with na.rm=TRUE, it removes the NA cases without replacing NA cases with an -Inf value.当我用 na.rm=TRUE 进行总结时,它会删除 NA 案例而不用 -Inf 值替换 NA 案例。
Where this gets a bit more unusual is that when I view the data frame using:更不寻常的是,当我使用以下方法查看数据框时:
View(Test.Summary.VO2)
I receive the following series of warning messages:我收到以下一系列警告消息:
There were 38 warnings (use warnings() to see them)
warnings()
Warning messages:
1: Unknown or uninitialised column: 'Quad'.
2: Unknown or uninitialised column: 'Quad'.
3: Unknown or uninitialised column: 'Quad'.
4: Unknown or uninitialised column: 'Quad'.
Later on in the script I generate a new variable called "Quad".稍后在脚本中,我生成了一个名为“Quad”的新变量。 But the warning above appears even after I clear the environment, and restart R Studio.但是即使在我清除环境并重新启动 R Studio 后,上述警告也会出现。 I have even tried renaming the .csv file and importing using a different dataframe name.我什至尝试重命名 .csv 文件并使用不同的数据框名称导入。 It's almost as if the column 'Quad' that is generated later in the script is hanging around somewhere in the environment.这几乎就像脚本中稍后生成的“Quad”列在环境中的某个地方徘徊。
I am really at a loss as to what might be happening here.我真的不知道这里可能会发生什么。
I hope one of the R experts on Stack can provide me with an idea on how to remedy this issue.我希望 Stack 上的一位 R 专家可以为我提供有关如何解决此问题的想法。
Thanks for you consideration.谢谢你的考虑。
See ?max
:见?max
:
The minimum and maximum of a numeric empty set are +Inf and -Inf (in this order!) which ensures transitivity, eg,
min(x1, min(x2)) == min(x1, x2)
.数字空集的最小值和最大值是 +Inf 和 -Inf(按此顺序!),它们确保传递性,例如min(x1, min(x2)) == min(x1, x2)
。 For numeric xmax(x) == -Inf
andmin(x) == +Inf
wheneverlength(x) == 0
(after removing missing values if requested).对于数字xmax(x) == -Inf
和min(x) == +Inf
只要length(x) == 0
(如果需要,在删除缺失值之后)。 However,pmax
andpmin
returnNA
if all the parallel elements areNA
even forna.rm = TRUE
.但是,如果所有并行元素都是NA
即使对于na.rm = TRUE
,pmax
和pmin
也会返回NA
。
You don't have any non-NA values for group D, so max
returns the value for an empty set.组 D 没有任何非 NA 值,因此max
返回空集的值。
Late to the party, but a solution would be to return NA instead of Inf when there is no value to maximize.迟到了,但是当没有要最大化的值时,解决方案是返回 NA 而不是 Inf。 This could be done with the hablar package's s function.这可以通过 hablar 包的 s 函数来完成。
library(dplyr)
library(hablar)
VO2 %>%
group_by(Name, Sex) %>%
summarise(Best.Score = max(s(VO2)))
which gives you:这给了你:
Name Sex Best.Score
<chr> <chr> <int>
1 AthleteA M 52
2 AthleteB M 56
3 AthleteC M 42
4 AthleteD M NA
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.