[英]Compare rows by different columns in a dataframe
I have a dataframe looking like this: 我的数据框看起来像这样:
id value1 value2 value3 value4
A 14 24 22 9
B 51 25 29 33
C 4 16 8 10
D 1 4 2 4
Now I want to compare each column of the row with the others rows in order to identify the rows where every value is higher. 现在我想比较该行的每一列与其他行,以便识别每个值更高的行。
So, for example for id D this would be A, B and C. For C it would be B, for A it's B and for B there is no row. 因此,例如对于id D,这将是A,B和C.对于C,它将是B,对于A,它是B,对于B,没有行。
I tried to do that by looping through the rows and comparing every column, but that takes a lot of time. 我尝试通过遍历行并比较每一列来做到这一点,但这需要花费很多时间。 The original dataset has about 5000 rows and 20 columns to compare. 原始数据集有大约5000行和20列要比较。 I am sure that there is a way to do that more efficiently. 我确信有一种方法可以更有效地做到这一点。 Thanks for your help! 谢谢你的帮助!
I think this works just fine: 我认为这很好用:
ind <- which(names(df) == "id")
apply(df[,-ind],1,function(x) df$id[!rowSums(!t(x < t(df[,-ind])))] )
# [[1]]
# [1] "B"
#
# [[2]]
# character(0)
#
# [[3]]
# [1] "B"
#
# [[4]]
# [1] "A" "B" "C"
I don't know a simple function to do this task. 我不知道执行此任务的简单功能。 Here is how I would do. 我就是这样做的。
library(dplyr)
DF <- data.frame(
id = c("A", "B", "C", "D"),
value1 = c(14, 51, 4, 1),
value2 = c(24, 25, 16, 4),
value3 = c(22, 29, 8, 2),
value4 = c(9, 33, 10, 4),
stringsAsFactors = FALSE)
# get the order for each value
tmp <- lapply(select(DF, -id), function(x) DF$id[order(x)])
# find a set of "biggers" for each id
tmp <- lapply(tmp, function(x) data.frame(
id = rep(x, rev(seq_along(x))-1),
bigger = x[lapply(seq_along(x), function(i)
which(seq_along(x) > i)) %>% unlist()],
stringsAsFactors = FALSE))
# inner_join all, this keeps "biggers" in all columns
out <- NULL
for (v in tmp) {
if (is.null(out)) {
out <- v
} else {
out <- inner_join(out, v, by = c("id", "bigger"))
}
}
This gets you: 这会让你:
out
# id bigger
#1 D C
#2 D A
#3 D B
#4 C B
#5 A B
Here's an approach that returns results in a data frame format. 这是一种以数据帧格式返回结果的方法。
library(tidyr)
library(dplyr)
# reshape data to long format
td <- d %>% gather(key, value, value1:value4)
# create a copy w/ different names for merging
td2 <- td %>% select(id2 = id, key, value2 = value)
# full outer join to produce one row per pair of IDs
dd <- merge(td, td2, by = "key", all = TRUE)
# the result
dd %>%
filter(id != id2) %>%
group_by(id, id2) %>%
summarise(all_less = !any(value >= value2)) %>%
filter(all_less)
results (id is less than id2) 结果 (id小于id2)
id id2 all_less
(fctr) (fctr) (lgl)
1 A B TRUE
2 C B TRUE
3 D A TRUE
4 D B TRUE
5 D C TRUE
data 数据
d <- structure(list(
id = structure(1:4, .Label = c("A", "B", "C", "D"), class = "factor"),
value1 = c(14L, 51L, 4L, 1L),
value2 = c(24L, 25L, 16L, 4L),
value3 = c(22L, 29L, 8L, 2L), value4 = c(9L, 33L, 10L, 4L)
),
.Names = c("id", "value1", "value2", "value3", "value4"),
class = "data.frame", row.names = c(NA, -4L)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.