简体   繁体   English

比较数据框中不同列的行

[英]Compare rows by different columns in a dataframe

I have a dataframe looking like this: 我的数据框看起来像这样:

id   value1  value2   value3   value4
A      14       24       22        9
B      51       25       29       33
C       4       16        8       10
D       1        4        2        4       

Now I want to compare each column of the row with the others rows in order to identify the rows where every value is higher. 现在我想比较该行的每一列与其他行,以便识别每个值更高的行。

So, for example for id D this would be A, B and C. For C it would be B, for A it's B and for B there is no row. 因此,例如对于id D,这将是A,B和C.对于C,它将是B,对于A,它是B,对于B,没有行。

I tried to do that by looping through the rows and comparing every column, but that takes a lot of time. 我尝试通过遍历行并比较每一列来做到这一点,但这需要花费很多时间。 The original dataset has about 5000 rows and 20 columns to compare. 原始数据集有大约5000行和20列要比较。 I am sure that there is a way to do that more efficiently. 我确信有一种方法可以更有效地做到这一点。 Thanks for your help! 谢谢你的帮助!

I think this works just fine: 我认为这很好用:

ind <- which(names(df) == "id")
apply(df[,-ind],1,function(x) df$id[!rowSums(!t(x < t(df[,-ind])))] )
# [[1]]
# [1] "B"
# 
# [[2]]
# character(0)
# 
# [[3]]
# [1] "B"
# 
# [[4]]
# [1] "A" "B" "C"

I don't know a simple function to do this task. 我不知道执行此任务的简单功能。 Here is how I would do. 我就是这样做的。

library(dplyr)

DF <- data.frame(
  id = c("A", "B", "C", "D"),
  value1 = c(14, 51, 4, 1),
  value2 = c(24, 25, 16, 4),
  value3 = c(22, 29, 8, 2),
  value4 = c(9, 33, 10, 4),
  stringsAsFactors = FALSE)

# get the order for each value
tmp <- lapply(select(DF, -id), function(x) DF$id[order(x)]) 

# find a set of "biggers" for each id 
tmp <- lapply(tmp, function(x) data.frame(
    id = rep(x, rev(seq_along(x))-1), 
    bigger = x[lapply(seq_along(x), function(i)
      which(seq_along(x) > i)) %>% unlist()],
    stringsAsFactors = FALSE)) 

# inner_join all, this keeps "biggers" in all columns
out <- NULL
for (v in tmp) {
  if (is.null(out)) {
    out <- v
  } else {
    out <- inner_join(out, v, by = c("id", "bigger"))
  }
}

This gets you: 这会让你:

out
#  id bigger
#1  D      C
#2  D      A
#3  D      B
#4  C      B
#5  A      B

Here's an approach that returns results in a data frame format. 这是一种以数据帧格式返回结果的方法。

library(tidyr)
library(dplyr)

# reshape data to long format
td <- d %>% gather(key, value, value1:value4)

# create a copy w/ different names for merging
td2 <- td %>% select(id2 = id, key, value2 = value)

# full outer join to produce one row per pair of IDs
dd <- merge(td, td2, by = "key", all = TRUE)

# the result
dd %>%
  filter(id != id2) %>% 
  group_by(id, id2) %>%
  summarise(all_less = !any(value >= value2)) %>%
  filter(all_less)

results (id is less than id2) 结果 (id小于id2)

     id    id2 all_less
  (fctr) (fctr)    (lgl)
1      A      B     TRUE
2      C      B     TRUE
3      D      A     TRUE
4      D      B     TRUE
5      D      C     TRUE

data 数据

d <- structure(list(
  id = structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor"), 
  value1 = c(14L, 51L, 4L, 1L), 
  value2 = c(24L, 25L, 16L, 4L), 
  value3 = c(22L, 29L, 8L, 2L), value4 = c(9L, 33L, 10L, 4L)
), 
.Names = c("id", "value1", "value2", "value3", "value4"), 
class = "data.frame", row.names = c(NA, -4L)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM