简体   繁体   English

如何通过匹配来自另一个数据帧的整个列中的字符串来检索一个数据帧中的值?

[英]How to retrieve value in one data frame by matching a string within an entire column from another data frame?

Say I have a data frame df1 like this below:假设我有一个如下所示的数据框df1

> df1
          probe                         OMIM
1  1565034_s_at                       601464
2     201000_at 601065 /// 613287 /// 616339
3     204565_at                       615652
4     205355_at            600301 /// 610006
5   205734_s_at                       601464
6   205735_s_at                       601464
7     206527_at            137150 /// 613163
8     209173_at                       606358
9   209459_s_at            137150 /// 613163
10    209460_at            137150 /// 613163
11    215465_at                             
12    223864_at                       610856
13    224742_at            612674 /// 613599

And a second data frame, df2 :第二个数据框df2

> df2
                                         platprobe   symbol
1   1565034_s_at,205734_s_at,242078_at,205735_s_at     AFF3
2                                        201000_at     AARS
3                                        201884_at   DNALI1
4                                      202779_s_at     PLK1
5                                        204565_at   ACOT13
6                              205355_at,226030_at   ACADSB
7      205808_at,207284_s_at,209135_at,210896_s_at   LIMCH1
8      206164_at,206165_s_at,206166_s_at,217528_at   SLC7A8
9                  206527_at,209459_s_at,209460_at     ABAT
10                             209173_at,228969_at     AGR2
11                                       215465_at   ABCA12
12                                     221024_s_at  TMEM144
13                                       223864_at ANKRD30A
14                 224742_at,228123_s_at,228124_at   ABHD12
15                           225421_at,225431_x_at   GALNT7
16                                       226120_at    PSAT1
17                                       228241_at     AGR3

I would like to add a new column to df1 , df1$symbol , based on matching df1$probe value with df2$platprobe .我想根据df1$probe值与df2$platprobe的匹配向 df1 添加一个新列df1 df1$symbol The result should be this:结果应该是这样的:

> df1
          probe                         OMIM    symbol
1  1565034_s_at                       601464      AFF3
2     201000_at 601065 /// 613287 /// 616339      AARS
3     204565_at                       615652    ACOT13
4     205355_at            600301 /// 610006    ACADSB
5   205734_s_at                       601464      AFF3
6   205735_s_at                       601464      AFF3
7     206527_at            137150 /// 613163      ABAT
8     209173_at                       606358      AGR2
9   209459_s_at            137150 /// 613163      ABAT
10    209460_at            137150 /// 613163      ABAT
11    215465_at                                 ABCA12
12    223864_at                       610856  ANKRD30A
13    224742_at            612674 /// 613599    ABHD12

The challenging part for me is that df2$platprobe in many cases contains various annotation apart from that one found in df1$probe .对我来说具有挑战性的部分是df2$platprobe在许多情况下包含各种注释,除了in df1$probe注释。 So, if I try:所以,如果我尝试:

#This will retrieve only perfect matches (where df2$platprobe contains only one possible value, such as ABCA12):
df1$symbol <- df2$symbol[df2$probe %in% df1$platprobe]

#And if I use 'grepl', that won't work:
#(The reason for using 'unlist' and 'strsplit' is because I thought that maybe breaking all possible
#values from the entire df2$platprobe into a object that would work. But it doesn't)

df1$symbol <- df2$symbol[grepl(df1$probe, unlist(strsplit(paste(df2$platprobe, sep=",", collapse=","), ",")))]

Any help is much appreciated.任何帮助深表感谢。

PS: also if you have a better idea for a more topic title, it is very welcome. PS:另外,如果您对更多主题的标题有更好的想法,非常欢迎。

Update Thank you, @Anoushiravan R.更新谢谢@Anoushiravan R。 And sorry for not putting the reproducible df's before.很抱歉之前没有放置可重现的df。 Now, here they are:现在,他们在这里:

df1 <- data.frame(probe=c("1565034_s_at", "201000_at", "204565_at", 
"205355_at", "205734_s_at", "205735_s_at", "206527_at", "209173_at", 
"209459_s_at", "209460_at", "215465_at", "223864_at", "224742_at"
), OMIM = c("601464", "601065 /// 613287 /// 616339", "615652", 
"600301 /// 610006", "601464", "601464", "137150 /// 613163", 
"606358", "137150 /// 613163", "137150 /// 613163", "", "610856", 
"612674 /// 613599"))
df2 <- data.frame(platprobe = c("1565034_s_at, 205734_s_at, 205735_s_at, 
227198_at, 242078_at, 243967_at", "201000_at", "201884_at", "202779_s_at",
"204565_at", "205355_at,226030_at", "205808_at, 207284_s_at, 209135_at, 
210896_s_at, 224996_at, 225008_at, 242037_at", "206164_at, 206165_s_at, 
206166_s_at, 217528_at", "206527_at, 209459_s_at,209460_at", "209173_at, 
228969_at", "215465_at", "221024_s_at", "223864_at","224742_at, 228123_s_at, 
228124_at", "225421_at,225431_x_at", "226120_at", "228241_at"), symbol=c("AFF3", 
"AARS", "DNALI1", "PLK1", "ACOT13", "ACADSB", "LIMCH1", "SLC7A8", "ABAT", "AGR2", 
"ABCA12", "TMEM144", "ANKRD30A", "ABHD12", "GALNT7", "PSAT1", "AGR3"))

You can use the following solution:您可以使用以下解决方案:

library(dplyr)
library(stringr)
library(purrr)

df1 %>%
  mutate(symbol = map_chr(probe, ~ df2$symbol[which(str_detect(df2$platprobe, .x))]))


          probe                         OMIM   symbol
1  1565034_s_at                       601464     AFF3
2     201000_at 601065 /// 613287 /// 616339     AARS
3     204565_at                       615652   ACOT13
4     205355_at            600301 /// 610006   ACADSB
5   205734_s_at                       601464     AFF3
6   205735_s_at                       601464     AFF3
7     206527_at            137150 /// 613163     ABAT
8     209173_at                       606358     AGR2
9   209459_s_at            137150 /// 613163     ABAT
10    209460_at            137150 /// 613163     ABAT
11    215465_at                                ABCA12
12    223864_at                       610856 ANKRD30A
13    224742_at            612674 /// 613599   ABHD12

Though the answer above serves the purpose, yet to show that it can be done without purrr also尽管上面的答案可以达到目的,但也表明它可以在没有purrr的情况下完成

library(dplyr)
library(tidyr)
library(stringr)

df1 %>% left_join(df2 %>% separate_rows(platprobe, sep = ',') %>%
                    mutate(platprobe = str_trim(platprobe)), by = c('probe' = 'platprobe'))

          probe                         OMIM   symbol
1  1565034_s_at                       601464     AFF3
2     201000_at 601065 /// 613287 /// 616339     AARS
3     204565_at                       615652   ACOT13
4     205355_at            600301 /// 610006   ACADSB
5   205734_s_at                       601464     AFF3
6   205735_s_at                       601464     AFF3
7     206527_at            137150 /// 613163     ABAT
8     209173_at                       606358     AGR2
9   209459_s_at            137150 /// 613163     ABAT
10    209460_at            137150 /// 613163     ABAT
11    215465_at                                ABCA12
12    223864_at                       610856 ANKRD30A
13    224742_at            612674 /// 613599   ABHD12

Another way to deal with your problem is based on your view / observation that your matching key is "collapsed" in the 2nd dataframe.处理您的问题的另一种方法是基于您的观点/观察,即您的匹配密钥在第二个 dataframe 中“折叠”。

{tidyr} has a great function to split nested values in new rows, ie tidyr()::separate_rows() . {tidyr}有一个很棒的 function 来拆分新行中的嵌套值,即tidyr()::separate_rows() This will turn your 2nd df in a long format.这会将您的第二个 df 转换为长格式。

Note: separate_rows() allows to split over multiple columns if needed.注意:如果需要, separate_rows()允许拆分多个列。 But here we use only your key platprobe .但在这里我们只使用您的密钥platprobe

library(dplyr)   # data crunching 
library(tidyr)   # data manipulation for generating tidy df

# how to separate the nested column values to rows
df2 %>% separate_rows(platprobe, sep = ",")

Checking the row-spread:检查行扩展:

# A tibble: 33 x 2
   platprobe    symbol
   <chr>        <chr> 
 1 1565034_s_at AFF3  
 2 205734_s_at  AFF3  
 3 242078_at    AFF3  
 4 205735_s_at  AFF3  
 5 201000_at    AARS  
...

You now have a proper alignment of the matching keys and do a left_join() to merge both data frames.您现在有一个正确的 alignment 匹配键并执行left_join()以合并两个数据帧。

# merging the "long" lookup df2 with df1
df1 %>% left_join(
     df2 %>% separate_rows(platprobe, sep = ",")
   , by = c("probe" = "platprobe")    # define matching keys in df1 and df2
)

This delivers这提供

          probe   symbol
1  1565034_s_at     AFF3
2     201000_at     AARS
3     204565_at   ACOT13
4     205355_at   ACADSB
...

In case you want to use grep for matching you can do this via sapply or lapply .如果您想使用grep进行匹配,您可以通过sapplylapply执行此操作。

df1$symbol <- df2$symbol[sapply(df1$probe, grep, df2$platprobe)]

df1
#          probe                         OMIM   symbol
#1  1565034_s_at                       601464     AFF3
#2     201000_at 601065 /// 613287 /// 616339     AARS
#3     204565_at                       615652   ACOT13
#4     205355_at            600301 /// 610006   ACADSB
#5   205734_s_at                       601464     AFF3
#6   205735_s_at                       601464     AFF3
#7     206527_at            137150 /// 613163     ABAT
#8     209173_at                       606358     AGR2
#9   209459_s_at            137150 /// 613163     ABAT
#10    209460_at            137150 /// 613163     ABAT
#11    215465_at                                ABCA12
#12    223864_at                       610856 ANKRD30A
#13    224742_at            612674 /// 613599   ABHD12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过匹配R中的一个列值来用另一个数据帧替换数据帧值? - How to replace the data frame value with another data frame by matching one column value in R? 将数据从一个数据帧匹配到另一数据帧 - matching data from one data frame to another 将数据从一个数据帧匹配到另一数据帧 - Matching data from one data frame to another 通过根据另一个数据框中列的值从一个数据框中提取列来创建新数据框 - creating a new data frame by extracting columns from one data frame based on the value of column in another data frame 从一个数据帧到另一数据帧的条件随机匹配 - conditional random matching from one data frame into another data frame R:从一个数据框中提取行,基于列名匹配另一个数据框中的值 - R: Extract Rows from One Data Frame, Based on Column Names Matching Values from Another Data Frame 如何通过匹配另一个数据框来填充数据框列值? - How to fill in data frame column values by matching another data frame? 将一个数据框中的列匹配到另一个 - Matching columns from one data frame to another 通过匹配变量将值从一个data.frame修改为另一个data.frame - Modifying value from one data.frame to another data.frame by matching a variable 通过匹配变量将值从一个 data.frame 添加到另一个 data.frame - Adding value from one data.frame to another data.frame by matching a variable
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM