[英]Highest and lowest 100 observations for sorted multiple columns in R
I came up with a more optimal semi-solution.我想出了一个更优化的半解决方案。 I have sorted my dataframe by Sector and Volume.
我已经按扇区和体积对我的 dataframe 进行了排序。
df <- structure(list(Customer = structure(1:17, .Label = c("A", "B",
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O",
"P", "Q"), class = "factor"), Sector = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Aviation",
"Biotech", "Construction"), class = "factor"), Volume = c(-5000L,
-3000L, 4000L, 6000L, 7000L, 9000L, -4000L, -1500L, 2000L, 3000L,
5000L, 6000L, -7000L, -4000L, 5000L, 7000L, 8000L)),
class = "data.frame", row.names = c(NA,-17L))
EDITED:编辑:
## > df
## Customer Sector Volume
## A Aviation - 5000
## B Aviation - 3000
## C Aviation 4000
## D Aviation 6000
## E Aviation 7000
## F Aviation 9000
## G Biotech - 4000
## H Biotech - 1500
## I Biotech 2000
## J Biotech 3000
## K Biotech 5000
## L Biotech 6000
## M Construction - 7000
## N Construction - 4000
## O Construction 5000
## P Construction 7000
## Q Construction 8000
Let's say I would like to leave the highest and lowest 2 customers per sector.假设我想为每个部门留下最高和最低的 2 个客户。 So, my final table should look like this:
所以,我的决赛桌应该是这样的:
## > df
## Customer Sector Volume
## A Aviation - 5000
## B Aviation - 3000
## E Aviation 7000
## F Aviation 9000
## G Biotech - 4000
## H Biotech - 1500
## K Biotech 5000
## L Biotech 6000
## M Construction - 7000
## N Construction - 4000
## P Construction 7000
## Q Construction 8000
The only difference is that I would like to see highest/lowest 100 customers per sector in my case instead of just 2.唯一的区别是,在我的案例中,我希望每个部门最多/最少 100 个客户,而不是只有 2 个。
Since each column is sorted you could remove NA
values with na.omit
and use head
and tail
to get top and bottom 100 values.由于每一列都已排序,您可以使用
na.omit
删除NA
值,并使用head
和tail
来获取顶部和底部 100 个值。
sapply(df[-1], function(x) {x1 <- na.omit(x);c(head(x1, 100), tail(x1, 100))})
Or similarly using apply
with MARGIN = 2
或者类似地使用
apply
与MARGIN = 2
apply(df[-1], 2, function(x) {x1 <- na.omit(x);c(head(x1, 100), tail(x1, 100))})
We can also create index to subset:我们还可以为子集创建索引:
sapply(df[-1], function(x)
{x1 <- na.omit(x);x1[c(1:100,(length(x1) - 100):length(x1))]})
EDIT编辑
For the updated data we can use slice
from dplyr
.对于更新的数据,我们可以使用
dplyr
中的slice
。
library(dplyr)
df %>% group_by(Sector) %>% slice(c(1:2, (n() -1):n()))
# Customer Sector Volume
# <fct> <fct> <int>
# 1 A Aviation -5000
# 2 B Aviation -3000
# 3 E Aviation 7000
# 4 F Aviation 9000
# 5 G Biotech -4000
# 6 H Biotech -1500
# 7 K Biotech 5000
# 8 L Biotech 6000
# 9 M Construction -7000
#10 N Construction -4000
#11 P Construction 7000
#12 Q Construction 8000
Or another way using top_n
.或使用
top_n
的另一种方式。
bind_rows(df %>% group_by(Sector) %>% top_n(2, Volume),
df %>% group_by(Sector) %>% top_n(-2, Volume)) %>%
arrange(Sector)
Using the library(data.table) you achieve your desired output using the following:使用库(data.table),您可以使用以下方法实现所需的 output:
library(data.table)
# convert the data.frame into a data.table
setDT(df)
# sort the data.table by Volume
setkey(df,Volume)
# rbind the smallest 2 volumes by sector and with the highest
# 2 volumes by sector
rbind(df[,tail(.SD,2),Sector],
df[,head(.SD,2),Sector])[order(Customer,Sector)]
## Sector Customer Volume
## 1: Aviation A -5000
## 2: Aviation B -3000
## 3: Aviation E 7000
## 4: Aviation F 9000
## 5: Biotech G -4000
## 6: Biotech H -1500
## 7: Biotech K 5000
## 8: Biotech L 6000
## 9: Construction M -7000
## 10: Construction N -4000
## 11: Construction P 7000
## 12: Construction Q 8000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.