简体   繁体   English

根据一列中的唯一值比较多个数据帧,并在R中的多个数据帧中查找第二列中的重叠值

[英]Comparing multiple data frames based on unique values in one column and finding overlapping values in second column in multiple data frames in R

I wanted to ask for advice based on a problem I am having in trying to identify intersecting values in multiple data frames, but in my mind this is a bit complex and I cant figure out how to do it using the normal intersect function. 我想根据我在尝试识别多个数据帧中的相交值时遇到的问题寻求建议,但是在我看来这有点复杂,我无法弄清楚如何使用常规的相交函数来做到这一点。

I have several data frames (up to 12) with multiple columns that are showing gene changes over time (for example 5 time points) and how other genes correlate with this change (ie, other genes that also go down, or up in a manner that correlates other genes in the data). 我有几个数据框(最多12个),其中多列显示基因随时间的变化(例如5个时间点),以及其他基因如何与此变化相关(即,其他基因也以某种方式下降或上升)关联数据中的其他基因)。 The analysis takes each gene one at a time, uses that gene as a reference and tests every single gene against it to see if the pattern of change over time of those genes correlate with the first reference gene. 分析一次获取每个基因,将该基因作为参考,并针对每个基因进行测试,以查看这些基因随时间变化的模式是否与第一个参考基因相关。 This is repeated for every single gene. 对于每个单个基因重复此过程。 So taking one data frame as an example, the results would appear as follows. 因此,以一个数据帧为例,结果将如下所示。

Column 1 contains genes that serve as the reference gene, this value can occur multiple times if other genes correlate with changes over time in this gene. 第1列包含用作参考基因的基因,如果其他基因与该基因随时间的变化相关,则此值可以多次出现。 for example if gene b, c and d correlate with gene a, the first two columns show as follows: 例如,如果基因b,c和d与基因a相关,则前两列显示如下:

a b
a c
a d

The same for gene b and so on and so fourth 20,000 times (number of genes)! 基因b依此类推,依此类推,第四次20,000次(基因数)! Hope this makes sense? 希望这有意义吗?

b a
b c
b d

The analyses above is carried in multiple different samples, so I will get up to 12 data frames which are different samples each with results detailed as above. 上面的分析是在多个不同的样本中进行的,因此我将获得多达12个数据帧,这些数据帧是不同的样本,每个样本的结果均如上所述。

Objective (and apologies in advance that I do not have code as I am not entirely sure where to start!) as I am thinking this might best be served by creating a function for this: For gene 'x' in column number 1, in every single data frame, I would like to see if column 2 has overlapping values. 客观的(并且我很抱歉,我没有代码,因为我不太确定从哪里开始!),因为我认为最好为此创建一个函数来解决此问题:对于第1列中的基因'x',在每个数据帧中,我想查看第2列是否有重叠值。

Taking the example above, multiple data frames may look like this: 以上面的示例为例,多个数据帧可能看起来像这样:

df1
a b
a c
a d
df2
a d
a c
a e
df3
a d
a e
a f

So comparing the data frames, the function would identify that for gene a, there is one column value between all data frame... gene d.. as it is common to all data frames for gene a. 因此,比较数据帧,该函数将识别出对于基因a,所有数据帧之间存在一列值...基因d ..因为对于基因a的所有数据帧都是通用的。

Similarly, the function would carry out this overlap analysis for every single gene... gene a,b,c..etc 同样,该函数将对每个单个基因进行重叠分析...基因a,b,c等

The output would be the values of the overlap for every single gene in column 2 that occurs for the same gene in column a across the data frames 输出将是第2列中每个单个基因的重叠值,该重叠值是跨数据帧在a列中的同一基因发生的

I am pasting head(analysis) 我正在粘贴头(分析)

Feature1           Feature2 delay      pBefore       pAfter  corBefore
1 ENSMUSG00000001525 ENSMUSG00000026211     0 0.1093914984 0.1093914984  0.7161907
2 ENSMUSG00000001525 ENSMUSG00000055653    -1 0.0916478944 0.1047749696  0.7414240
3 ENSMUSG00000001525 ENSMUSG00000003038     0 0.0006810160 0.0006810160  0.9786161

plus many many more genes in feature 1, each with genes in feature 2 associated with genes in feature 1 在特征1中加上许多其他基因,每个特征2中的基因与特征1中的基因相关

this data frame would be one sample and I would have a separate result for the other samples 这个数据框将是一个样本,其他样本将有一个单独的结果

I would really appreciate any hints as to how to create code to achieve this goal. 我非常感谢有关如何创建代码以实现此目标的任何提示。 In additon, it would be nice to be able to specify that I would also liek to see over lap of genes that only contain, ie pBefore of >= 0.8 for example, or same for the delay column etc... 另外,很高兴能够指定我也很乐意看到仅包含基因的一堆,例如pBefore> = 0.8,或者对于延迟列等...

Many thanks for taking the time to read this! 非常感谢您抽出宝贵的时间阅读本文!

If I understand correctly, you can add all 12 dataframes as 如果我理解正确,则可以将所有12个数据帧添加为

   df_final = pd.concat([df1,df2.....df12])

Find the combination of genes present in all 12 dataframe 查找所有12个数据框中存在的基因组合

   df_n = df_final.groupby(['A','B']).size().reset_index(name = 'count') 

As there are 12 Dataframe 由于有12个数据框

   df_n[df_n['count']==12] 

will give you the pair of genes in all 12 dataframes. 将在所有12个数据帧中为您提供一对基因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM