如何根據 R 中一個表中的兩列之間的依賴關系和另一個表的結果過濾結果？

Question

我想知道是否有一種方法（function 或幾行優雅的代碼）可以幫助我過濾具有這種麻煩的數據幀結構/列依賴性的結果。

我有一個場景，我的特征彼此高度相關（即表 1）。
我還有一個單獨的表，列出了每個單獨功能的另一個分數（即表 2）。

表格1：

feature1, feature2, feature_correlation_score  
a, b, 0.7      
c, d, 0.5  
b, a, 0.7   
d, c, 0.5     
e, f, 0.8,  
f, e, 0.8

表 2：

feature, label_correlation_score       
a, 0.20    
b, 0.15    
c, 0.08   
d, 0.04  
e, 0.02   
f, 0.02

我想做的是：
(1)識別每個唯一的feature1和feature2對（即a、b 和b、a 相同）。
(2)然后檢查表 2 中對於一對中的每個值的label_correlation_score是多少，並且只保留每個唯一對之間具有最高label_correlation_score的特征。
(3)將結果存儲在一個新表中，如下所示：

決賽桌：

feature, label_correlation_score  
a, 0.20  
c, 0.08  
e, 0.02

注意：它可以是在最后一行中選擇的 e 或 f，因為它們的label_correlation_scores是相同的。

提前致謝！

編輯：我也對使用data.table的等效代碼感興趣。

Answer 1

如果您可以使用tidyverse ，這是一種方法。

首先，我們只保留feature1小於feature2的行，從而刪除重復項（假設兩個版本始終可用）。
然后，我們為feature1和feature2 label_correlation_score分別給列后綴_1和_2 ）。
然后，我們將最大的分數存儲在label_correlation_score列中，並將與之對應的特征存儲在feature列中。
最后，我們只保留feature和label_correlation_score列。

library(tidyverse)

df1 <- read_csv("feature1, feature2, feature_correlation_score
a, b, 0.7
c, d, 0.5
b, a, 0.7
d, c, 0.5
e, f, 0.8,
f, e, 0.8")

df2 <- read_csv("feature, label_correlation_score
a, 0.20
b, 0.15
c, 0.08
d, 0.04
e, 0.02
f, 0.02 ")

df1 %>% 
  filter(feature1 < feature2) %>% 
  left_join(df2, by = c("feature1" = "feature")) %>% 
  left_join(df2, by = c("feature2" = "feature"), suffix = c("_1", "_2")) %>% 
  mutate(label_correlation_score = pmax(label_correlation_score_1, label_correlation_score_2),
         feature = if_else(label_correlation_score_1 > label_correlation_score_2, feature1, feature2)) %>% 
  select(feature, label_correlation_score)

這使

# A tibble: 3 x 2
  feature label_correlation_score
  <chr>                     <dbl>
1 a                          0.2 
2 c                          0.08
3 f                          0.02

如何根據 R 中一個表中的兩列之間的依賴關系和另一個表的結果過濾結果？

問題描述

1 個解決方案

解決方案1
1 已采納 2020-04-09 07:09:16

如何根據 R 中一個表中的兩列之間的依賴關系和另一個表的結果過濾結果？

問題描述

1 個解決方案

解決方案1 1 已采納 2020-04-09 07:09:16

解決方案1
1 已采納 2020-04-09 07:09:16