简体   繁体   English

R 中已排序多列的最高和最低 100 个观察值

[英]Highest and lowest 100 observations for sorted multiple columns in R

I came up with a more optimal semi-solution.我想出了一个更优化的半解决方案。 I have sorted my dataframe by Sector and Volume.我已经按扇区和体积对我的 dataframe 进行了排序。

df <- structure(list(Customer = structure(1:17, .Label = c("A", "B", 
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", 
"P", "Q"), class = "factor"), Sector = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Aviation", 
"Biotech", "Construction"), class = "factor"), Volume = c(-5000L, 
-3000L, 4000L, 6000L, 7000L, 9000L, -4000L, -1500L, 2000L, 3000L, 
5000L, 6000L, -7000L, -4000L, 5000L, 7000L, 8000L)), 
class = "data.frame", row.names = c(NA,-17L))

EDITED:编辑:

## > df
##   Customer  Sector      Volume
##     A      Aviation     - 5000
##     B      Aviation     - 3000
##     C      Aviation       4000
##     D      Aviation       6000
##     E      Aviation       7000
##     F      Aviation       9000
##     G      Biotech      - 4000
##     H      Biotech      - 1500
##     I      Biotech        2000
##     J      Biotech        3000
##     K      Biotech        5000
##     L      Biotech        6000
##     M      Construction - 7000
##     N      Construction - 4000
##     O      Construction   5000
##     P      Construction   7000
##     Q      Construction   8000

Let's say I would like to leave the highest and lowest 2 customers per sector.假设我想为每个部门留下最高和最低的 2 个客户。 So, my final table should look like this:所以,我的决赛桌应该是这样的:

## > df
##   Customer  Sector      Volume
##     A      Aviation     - 5000
##     B      Aviation     - 3000
##     E      Aviation       7000
##     F      Aviation       9000
##     G      Biotech      - 4000
##     H      Biotech      - 1500
##     K      Biotech        5000
##     L      Biotech        6000
##     M      Construction - 7000
##     N      Construction - 4000
##     P      Construction   7000
##     Q      Construction   8000

The only difference is that I would like to see highest/lowest 100 customers per sector in my case instead of just 2.唯一的区别是,在我的案例中,我希望每个部门最多/最少 100 个客户,而不是只有 2 个。

Since each column is sorted you could remove NA values with na.omit and use head and tail to get top and bottom 100 values.由于每一列都已排序,您可以使用na.omit删除NA值,并使用headtail来获取顶部和底部 100 个值。

sapply(df[-1], function(x) {x1 <- na.omit(x);c(head(x1, 100), tail(x1, 100))})

Or similarly using apply with MARGIN = 2或者类似地使用applyMARGIN = 2

apply(df[-1], 2, function(x) {x1 <- na.omit(x);c(head(x1, 100), tail(x1, 100))})

We can also create index to subset:我们还可以为子集创建索引:

sapply(df[-1], function(x) 
        {x1 <- na.omit(x);x1[c(1:100,(length(x1) - 100):length(x1))]})

EDIT编辑

For the updated data we can use slice from dplyr .对于更新的数据,我们可以使用dplyr中的slice

library(dplyr)
df %>% group_by(Sector) %>% slice(c(1:2, (n() -1):n()))


#  Customer Sector       Volume
#   <fct>    <fct>         <int>
# 1 A        Aviation      -5000
# 2 B        Aviation      -3000
# 3 E        Aviation       7000
# 4 F        Aviation       9000
# 5 G        Biotech       -4000
# 6 H        Biotech       -1500
# 7 K        Biotech        5000
# 8 L        Biotech        6000
# 9 M        Construction  -7000
#10 N        Construction  -4000
#11 P        Construction   7000
#12 Q        Construction   8000

Or another way using top_n .或使用top_n的另一种方式。

bind_rows(df %>% group_by(Sector) %>% top_n(2, Volume),
          df %>% group_by(Sector) %>% top_n(-2, Volume)) %>%
arrange(Sector)

Using the library(data.table) you achieve your desired output using the following:使用库(data.table),您可以使用以下方法实现所需的 output:

library(data.table)
# convert the data.frame into a data.table
setDT(df) 
# sort the data.table by Volume
setkey(df,Volume)
# rbind the  smallest 2 volumes by sector and with the highest 
# 2 volumes by sector
rbind(df[,tail(.SD,2),Sector],
      df[,head(.SD,2),Sector])[order(Customer,Sector)]


##           Sector Customer Volume
##  1:     Aviation        A  -5000
##  2:     Aviation        B  -3000
##  3:     Aviation        E   7000
##  4:     Aviation        F   9000
##  5:      Biotech        G  -4000
##  6:      Biotech        H  -1500
##  7:      Biotech        K   5000
##  8:      Biotech        L   6000
##  9: Construction        M  -7000
## 10: Construction        N  -4000
## 11: Construction        P   7000
## 12: Construction        Q   8000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM