简体   繁体   中英

How to filter results based on dependencies between two columns in one table and results from another table in R?

I am wondering if there is a way (a function or just few elegant lines of code) that can help me filter results with this troublesome data frame structure/ column dependency.

I have a scenario where I have features that are highly correlated with each other (ie table 1).
I also have a separate table that lists another score for each individual feature (ie table 2).

Table 1:

feature1, feature2, feature_correlation_score  
a, b, 0.7      
c, d, 0.5  
b, a, 0.7   
d, c, 0.5     
e, f, 0.8,  
f, e, 0.8 

Table 2:

feature, label_correlation_score       
a, 0.20    
b, 0.15    
c, 0.08   
d, 0.04  
e, 0.02   
f, 0.02    

What I want to do is:
(1) Identify each unique feature1 and feature2 pair (ie a, b and b, a are the same).
(2) Then examine what the label_correlation_score is from table 2 for each value in a pair, and only keep the feature that has the highest label_correlation_score between each unique pair.
(3) Store the results in a new table that looks like this:

Final table:

feature, label_correlation_score  
a, 0.20  
c, 0.08  
e, 0.02

Note: it could be either e or f selected in the last row because their label_correlation_scores are the same.

Thanks in advance!

Edit: I'm also interested in what the equivalent code using data.table would be.

If you are okay with using the tidyverse , here's one approach.

  • First, we keep only rows for which feature1 is less than feature2 , thus removing duplicates (assumes both versions are always available).
  • Then, we join the label_correlation_score for both feature1 and feature2 (giving the columns suffixes _1 and _2 , respectively).
  • Then, we store the largest score in the label_correlation_score column and the feature corresponding to this in the feature column.
  • Finally, we keep only the feature and label_correlation_score columns.
library(tidyverse)

df1 <- read_csv("feature1, feature2, feature_correlation_score
a, b, 0.7
c, d, 0.5
b, a, 0.7
d, c, 0.5
e, f, 0.8,
f, e, 0.8")

df2 <- read_csv("feature, label_correlation_score
a, 0.20
b, 0.15
c, 0.08
d, 0.04
e, 0.02
f, 0.02 ")

df1 %>% 
  filter(feature1 < feature2) %>% 
  left_join(df2, by = c("feature1" = "feature")) %>% 
  left_join(df2, by = c("feature2" = "feature"), suffix = c("_1", "_2")) %>% 
  mutate(label_correlation_score = pmax(label_correlation_score_1, label_correlation_score_2),
         feature = if_else(label_correlation_score_1 > label_correlation_score_2, feature1, feature2)) %>% 
  select(feature, label_correlation_score)

which gives

# A tibble: 3 x 2
  feature label_correlation_score
  <chr>                     <dbl>
1 a                          0.2 
2 c                          0.08
3 f                          0.02

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM