简体   繁体   English

使用频率在R中提取变量

[英]extracting variables in R using frequencies

Suppose I have a dataframe: 假设我有一个数据帧:

 x  y
 a  1
 b  2
 a  3
 a  4
 b  5
 c  6
 a  7
 d  8
 a  9
 b 10
 e 12
 b 13
 c 15

I want to create another dataframe that includes only the x values that occur at least 3 times ( a and b , in this case), and their highest corresponding y values. 我想创建另一个数据帧,其中仅包含至少出现3次的x值(在本例中为ab ),以及它们最高的相应y值。

So I want the output as: 所以我希望输出为:

x   y
a   9
b   13

Here 9 and 13 are the highest values of a and b respectively 这里913分别是ab的最高值

I tried using: 我试过用:

sort-(table(x,y)) 

but it did not work. 但它不起作用。

The data.table package is great for this. data.table包非常适合这个。 If df is the original data, you can do 如果df是原始数据,您可以这样做

library(data.table)
setDT(df)[, .(y = max(y)[.N >= 3]), by=x]
#    x  y
# 1: a  9
# 2: b 13

.N is an integer that tells us how many rows are in each group (which we've set to x here). .N是一个整数,告诉我们每组中有多少行(我们在这里设置为x )。 So we just subset max(y) such that .N is at least three. 所以我们只将max(y)子集化,使得.N至少为3。

Here's one way, using subset to omit any x that occur less than 3 times, and then aggregate to find the maximum value by group: 这是一种方法,使用subset省略任何少于3次的x ,然后aggregate以按组查找最大值:

d <- read.table(text='x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15', header=TRUE)


with(subset(d, x %in% names(which(table(d$x) >= 3))),
     aggregate(list(y=y), list(x=x), max))

#   x  y
# 1 a  9
# 2 b 13

And for good measure, a dplyr approach: 并且为了更好的衡量, dplyr方法:

library(dplyr)
d %>% 
  group_by(x) %>% 
  filter(n() >= 3) %>% 
  summarise(max(y))


# Source: local data frame [2 x 2]
# 
#    x max(y)
# 1 a      9
# 2 b     13

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM