简体   繁体   English

如何根据 R 中一个表中的两列之间的依赖关系和另一个表的结果过滤结果?

[英]How to filter results based on dependencies between two columns in one table and results from another table in R?

I am wondering if there is a way (a function or just few elegant lines of code) that can help me filter results with this troublesome data frame structure/ column dependency.我想知道是否有一种方法(function 或几行优雅的代码)可以帮助我过滤具有这种麻烦的数据帧结构/列依赖性的结果。

I have a scenario where I have features that are highly correlated with each other (ie table 1).我有一个场景,我的特征彼此高度相关(即表 1)。
I also have a separate table that lists another score for each individual feature (ie table 2).我还有一个单独的表,列出了每个单独功能的另一个分数(即表 2)。

Table 1:表格1:

feature1, feature2, feature_correlation_score  
a, b, 0.7      
c, d, 0.5  
b, a, 0.7   
d, c, 0.5     
e, f, 0.8,  
f, e, 0.8 

Table 2:表 2:

feature, label_correlation_score       
a, 0.20    
b, 0.15    
c, 0.08   
d, 0.04  
e, 0.02   
f, 0.02    

What I want to do is:我想做的是:
(1) Identify each unique feature1 and feature2 pair (ie a, b and b, a are the same). (1)识别每个唯一的feature1feature2对(即a、b 和b、a 相同)。
(2) Then examine what the label_correlation_score is from table 2 for each value in a pair, and only keep the feature that has the highest label_correlation_score between each unique pair. (2)然后检查表 2 中对于一对中的每个值的label_correlation_score是多少,并且只保留每个唯一对之间具有最高label_correlation_score的特征。
(3) Store the results in a new table that looks like this: (3)将结果存储在一个新表中,如下所示:

Final table:决赛桌:

feature, label_correlation_score  
a, 0.20  
c, 0.08  
e, 0.02

Note: it could be either e or f selected in the last row because their label_correlation_scores are the same.注意:它可以是在最后一行中选择的 e 或 f,因为它们的label_correlation_scores是相同的。

Thanks in advance!提前致谢!

Edit: I'm also interested in what the equivalent code using data.table would be.编辑:我也对使用data.table的等效代码感兴趣。

If you are okay with using the tidyverse , here's one approach.如果您可以使用tidyverse ,这是一种方法。

  • First, we keep only rows for which feature1 is less than feature2 , thus removing duplicates (assumes both versions are always available).首先,我们只保留feature1小于feature2的行,从而删除重复项(假设两个版本始终可用)。
  • Then, we join the label_correlation_score for both feature1 and feature2 (giving the columns suffixes _1 and _2 , respectively).然后,我们为feature1feature2 label_correlation_score分别给列后缀_1_2 )。
  • Then, we store the largest score in the label_correlation_score column and the feature corresponding to this in the feature column.然后,我们将最大的分数存储在label_correlation_score列中,并将与之对应的特征存储在feature列中。
  • Finally, we keep only the feature and label_correlation_score columns.最后,我们只保留featurelabel_correlation_score列。
library(tidyverse)

df1 <- read_csv("feature1, feature2, feature_correlation_score
a, b, 0.7
c, d, 0.5
b, a, 0.7
d, c, 0.5
e, f, 0.8,
f, e, 0.8")

df2 <- read_csv("feature, label_correlation_score
a, 0.20
b, 0.15
c, 0.08
d, 0.04
e, 0.02
f, 0.02 ")

df1 %>% 
  filter(feature1 < feature2) %>% 
  left_join(df2, by = c("feature1" = "feature")) %>% 
  left_join(df2, by = c("feature2" = "feature"), suffix = c("_1", "_2")) %>% 
  mutate(label_correlation_score = pmax(label_correlation_score_1, label_correlation_score_2),
         feature = if_else(label_correlation_score_1 > label_correlation_score_2, feature1, feature2)) %>% 
  select(feature, label_correlation_score)

which gives这使

# A tibble: 3 x 2
  feature label_correlation_score
  <chr>                     <dbl>
1 a                          0.2 
2 c                          0.08
3 f                          0.02

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在R中基于另一个表过滤一个表 - Filter one table based on another in R 检查一个表(X)中的值是否在具有R data.table的另一个表(Y)中的两列中的值之间 - Check if a value in one table (X) is between the values in two columns in another table (Y) with R data.table 在 R 中,如何根据来自其他两列的输入使用 CIr 函数的结果填充两列? - In R, how can I populate two columns with the results of the CIr function based on inputs from two other columns? 如何从一个表,另一个表中查找搜索词,然后在结果中创建新列? - How to find search words from a table, in another table, and then create new columns of the results? 使用R选择一个表中来自另一表中两列范围的行 - Select rows in one table that comes from a range of two columns in another table using R 在R中,如何使用一个表来定义要在另一表中用于双向ANOVA的列? - in R, How to use one table, to define columns to be used for two-way ANOVA in another table? 在 R Shiny 中,如何对数据帧的指定列求和并将结果输出到表格中? - In R Shiny, how to sum specified columns of a dataframe and output the results into a table? 如何基于另一个表[R或Python]重新编码一个表中的多列? - How to recode multiple columns in a table based on another table [R or Python]? 根据另一个表中的列从一个表中提取值 - Extract Values from one table based on columns in another table 为R中不同功能的多个结果创建一个输出表 - Creating one output table for multiple results from different functions in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM