使用频率在R中提取变量

Question

Suppose I have a dataframe: 假设我有一个数据帧：

I want to create another dataframe that includes only the x values that occur at least 3 times ( a and b , in this case), and their highest corresponding y values. 我想创建另一个数据帧，其中仅包含至少出现3次的x值（在本例中为a和b ），以及它们最高的相应y值。

So I want the output as: 所以我希望输出为：

x   y
a   9
b   13

Here 9 and 13 are the highest values of a and b respectively 这里9和13分别是a和b的最高值

I tried using: 我试过用：

sort-(table(x,y))

but it did not work. 但它不起作用。

Answer 1

The data.table package is great for this. data.table包非常适合这个。 If df is the original data, you can do 如果df是原始数据，您可以这样做

library(data.table)
setDT(df)[, .(y = max(y)[.N >= 3]), by=x]
#    x  y
# 1: a  9
# 2: b 13

.N is an integer that tells us how many rows are in each group (which we've set to x here). .N是一个整数，告诉我们每组中有多少行（我们在这里设置为x ）。 So we just subset max(y) such that .N is at least three. 所以我们只将max(y)子集化，使得.N至少为3。

Answer 2

Here's one way, using subset to omit any x that occur less than 3 times, and then aggregate to find the maximum value by group: 这是一种方法，使用subset省略任何少于3次的x ，然后aggregate以按组查找最大值：

d <- read.table(text='x y
a 1
b 2
a 3
a 4
b 5
c 6
a 7
d 8
a 9
b 10
e 12
b 13
c 15', header=TRUE)


with(subset(d, x %in% names(which(table(d$x) >= 3))),
     aggregate(list(y=y), list(x=x), max))

#   x  y
# 1 a  9
# 2 b 13

And for good measure, a dplyr approach: 并且为了更好的衡量， dplyr方法：

library(dplyr)
d %>% 
  group_by(x) %>% 
  filter(n() >= 3) %>% 
  summarise(max(y))


# Source: local data frame [2 x 2]
# 
#    x max(y)
# 1 a      9
# 2 b     13

使用频率在R中提取变量

问题描述

2 个解决方案

解决方案1
7 2015-01-22 01:53:44

解决方案2
6 已采纳 2015-01-22 01:51:32

使用频率在R中提取变量

问题描述

2 个解决方案

解决方案1 7 2015-01-22 01:53:44

解决方案2 6 已采纳 2015-01-22 01:51:32

解决方案1
7 2015-01-22 01:53:44

解决方案2
6 已采纳 2015-01-22 01:51:32