[英]extracting variables in R using frequencies
Suppose I have a dataframe: 假设我有一个数据帧:
x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15
I want to create another dataframe that includes only the x
values that occur at least 3 times ( a
and b
, in this case), and their highest corresponding y
values. 我想创建另一个数据帧,其中仅包含至少出现3次的x
值(在本例中为a
和b
),以及它们最高的相应y
值。
So I want the output as: 所以我希望输出为:
x y
a 9
b 13
Here 9
and 13
are the highest values of a
and b
respectively 这里9
和13
分别是a
和b
的最高值
I tried using: 我试过用:
sort-(table(x,y))
but it did not work. 但它不起作用。
The data.table
package is great for this. data.table
包非常适合这个。 If df
is the original data, you can do 如果df
是原始数据,您可以这样做
library(data.table)
setDT(df)[, .(y = max(y)[.N >= 3]), by=x]
# x y
# 1: a 9
# 2: b 13
.N
is an integer that tells us how many rows are in each group (which we've set to x
here). .N
是一个整数,告诉我们每组中有多少行(我们在这里设置为x
)。 So we just subset max(y)
such that .N
is at least three. 所以我们只将max(y)
子集化,使得.N
至少为3。
Here's one way, using subset
to omit any x
that occur less than 3 times, and then aggregate
to find the maximum value by group: 这是一种方法,使用subset
省略任何少于3次的x
,然后aggregate
以按组查找最大值:
d <- read.table(text='x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15', header=TRUE)
with(subset(d, x %in% names(which(table(d$x) >= 3))),
aggregate(list(y=y), list(x=x), max))
# x y
# 1 a 9
# 2 b 13
And for good measure, a dplyr
approach: 并且为了更好的衡量, dplyr
方法:
library(dplyr)
d %>%
group_by(x) %>%
filter(n() >= 3) %>%
summarise(max(y))
# Source: local data frame [2 x 2]
#
# x max(y)
# 1 a 9
# 2 b 13
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.