[英]How to filter results based on dependencies between two columns in one table and results from another table in R?
I am wondering if there is a way (a function or just few elegant lines of code) that can help me filter results with this troublesome data frame structure/ column dependency.我想知道是否有一种方法(function 或几行优雅的代码)可以帮助我过滤具有这种麻烦的数据帧结构/列依赖性的结果。
I have a scenario where I have features that are highly correlated with each other (ie table 1).我有一个场景,我的特征彼此高度相关(即表 1)。
I also have a separate table that lists another score for each individual feature (ie table 2).我还有一个单独的表,列出了每个单独功能的另一个分数(即表 2)。
Table 1:表格1:
feature1, feature2, feature_correlation_score
a, b, 0.7
c, d, 0.5
b, a, 0.7
d, c, 0.5
e, f, 0.8,
f, e, 0.8
Table 2:表 2:
feature, label_correlation_score
a, 0.20
b, 0.15
c, 0.08
d, 0.04
e, 0.02
f, 0.02
What I want to do is:我想做的是:
(1) Identify each unique feature1
and feature2
pair (ie a, b and b, a are the same). (1)识别每个唯一的
feature1
和feature2
对(即a、b 和b、a 相同)。
(2) Then examine what the label_correlation_score
is from table 2 for each value in a pair, and only keep the feature that has the highest label_correlation_score
between each unique pair. (2)然后检查表 2 中对于一对中的每个值的
label_correlation_score
是多少,并且只保留每个唯一对之间具有最高label_correlation_score
的特征。
(3) Store the results in a new table that looks like this: (3)将结果存储在一个新表中,如下所示:
Final table:决赛桌:
feature, label_correlation_score
a, 0.20
c, 0.08
e, 0.02
Note: it could be either e or f selected in the last row because their label_correlation_scores
are the same.注意:它可以是在最后一行中选择的 e 或 f,因为它们的
label_correlation_scores
是相同的。
Thanks in advance!提前致谢!
Edit: I'm also interested in what the equivalent code using data.table
would be.编辑:我也对使用
data.table
的等效代码感兴趣。
If you are okay with using the tidyverse
, here's one approach.如果您可以使用
tidyverse
,这是一种方法。
feature1
is less than feature2
, thus removing duplicates (assumes both versions are always available).feature1
小于feature2
的行,从而删除重复项(假设两个版本始终可用)。label_correlation_score
for both feature1
and feature2
(giving the columns suffixes _1
and _2
, respectively).feature1
和feature2
label_correlation_score
分别给列后缀_1
和_2
)。label_correlation_score
column and the feature corresponding to this in the feature
column.label_correlation_score
列中,并将与之对应的特征存储在feature
列中。feature
and label_correlation_score
columns.feature
和label_correlation_score
列。library(tidyverse)
df1 <- read_csv("feature1, feature2, feature_correlation_score
a, b, 0.7
c, d, 0.5
b, a, 0.7
d, c, 0.5
e, f, 0.8,
f, e, 0.8")
df2 <- read_csv("feature, label_correlation_score
a, 0.20
b, 0.15
c, 0.08
d, 0.04
e, 0.02
f, 0.02 ")
df1 %>%
filter(feature1 < feature2) %>%
left_join(df2, by = c("feature1" = "feature")) %>%
left_join(df2, by = c("feature2" = "feature"), suffix = c("_1", "_2")) %>%
mutate(label_correlation_score = pmax(label_correlation_score_1, label_correlation_score_2),
feature = if_else(label_correlation_score_1 > label_correlation_score_2, feature1, feature2)) %>%
select(feature, label_correlation_score)
which gives这使
# A tibble: 3 x 2
feature label_correlation_score
<chr> <dbl>
1 a 0.2
2 c 0.08
3 f 0.02
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.