简体   繁体   English

比较数据框中的两列是否匹配,并由此创建一个包含匹配项的新数据框

[英]Comparing two columns in a data frame for matches and from this creating a new data frame that contains the matches

please can you help me again?请问你能再帮我一次吗?

I have a data frame that contains 4 columns, which are either a gene symbol or a rank that I have assigned the gene symbol like this:我有一个包含 4 列的数据框,它们是基因符号或我已分配基因符号的等级,如下所示:

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  1        BIRCA    1        MYCN
[2]  2        MYCN     2        MOB4
[3]  3        ATXN1    3        ABHD17C
[4]  4        ABHD17C  4        AEBP2
5 etc... for up to 6000 rows in some data sets. 
the ts columns are usually a lot longer than the mb columns. 

I want to arrange the data so that non-duplicates are removed thereby leaving only genes that appear in both columns of the data frame eg我想安排数据,以便删除非重复项,从而只留下出现在数据框两列中的基因,例如

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  2        MYCN     1        MYCN
[2]  4        ABHD17C  3        ABHD17C

In this example of the desired outcome, the non-duplicated genes have been removed leaving only genes that appeared in both lists to begin with.在这个期望结果的例子中,非重复基因已被删除,只留下出现在两个列表中的基因。

I have tried many things like:我尝试了很多事情,例如:

`df[df$mb_gene %in% df$ts_gene,]` 

but it doesn't work and seems to hit and miss some gene 2) I attempted to write an IF function but my skills are to limited.但它不起作用,似乎碰巧错过了一些基因 2)我试图写一个IF function 但我的技能有限。

I hope I have described this well enough but if I can clarify anything please ask, I'm really stuck.我希望我已经很好地描述了这一点,但如果我能澄清任何事情,请问,我真的被卡住了。 Thanks in advance!提前致谢!

In a data.frame , typically a row is a complete observation, meaning that all data in it correlates (somehow) with the rest.data.frame中,通常一行是完整的观察,这意味着其中的所有数据(以某种方式)与 rest 相关。 In a survey, one row is either one person (all questions) or one question for one person.在调查中,一行是一个人(所有问题)或一个人一个问题。 In your data here, though, your first row BIRCA and MYCN are completely separate, meaning you want to remove one without removing the other.但是,在您的数据中,您的第一行BIRCAMYCN是完全分开的,这意味着您想要删除一个而不删除另一个。 In a "data-science-y" view, this to me suggests your data is improperly shaped.在“数据科学-y”视图中,这对我来说表明您的数据形状不正确。

In order to do what you want, we need to split them into separate frames.为了做你想做的事,我们需要将它们分成单独的框架。

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
mb_rank  mb_gene  ts_rank  ts_gene
1        BIRCA    1        MYCN
2        MYCN     2        MOB4
3        ATXN1    3        ABHD17C
4        ABHD17C  4        AEBP2")

df1 <- df[,1:2]
df2 <- df[,3:4]
df1
#   mb_rank mb_gene
# 1       1   BIRCA
# 2       2    MYCN
# 3       3   ATXN1
# 4       4 ABHD17C
df2
#   ts_rank ts_gene
# 1       1    MYCN
# 2       2    MOB4
# 3       3 ABHD17C
# 4       4   AEBP2

From here, we can use intersect to find genes in common:从这里,我们可以使用intersect来查找共同的基因:

incommon <- intersect(df1$mb_gene, df2$ts_gene)
df1[df1$mb_gene %in% incommon,]
#   mb_rank mb_gene
# 2       2    MYCN
# 4       4 ABHD17C
df2[df2$ts_gene %in% incommon,]
#   ts_rank ts_gene
# 1       1    MYCN
# 3       3 ABHD17C

If you are 100% certain that you will always have the same number of rows in each, then you can merely cbind these together:如果您 100% 确定每个行中的行数始终相同,那么您只需将它们cbind在一起:

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[df2$ts_gene %in% incommon,]
)
#   mb_rank mb_gene ts_rank ts_gene
# 2       2    MYCN       1    MYCN
# 4       4 ABHD17C       3 ABHD17C

However, if there is a chance that there will be different numbers in each, then you will run into problems.但是,如果每个都有不同的数字,那么您将遇到问题。 If the number of one is a multiple of the other, you will get "recycling" of data and a warning, but you will still get data (which I think is a mistake):如果一个的数量是另一个的倍数,你会得到数据的“回收”和警告,但你仍然会得到数据(我认为这是一个错误):

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2
)
# Warning in data.frame(..., check.names = FALSE) :
#   row names were found from a short variable and have been discarded
#   mb_rank mb_gene ts_rank ts_gene
# 1       2    MYCN       1    MYCN
# 2       4 ABHD17C       2    MOB4
# 3       2    MYCN       3 ABHD17C
# 4       4 ABHD17C       4   AEBP2

If not a multiple, though, you'll just get an error:但是,如果不是倍数,您只会得到一个错误:

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[1:3,]
)
# Error in data.frame(..., check.names = FALSE) : 
#   arguments imply differing number of rows: 2, 3

I suggest that you think about this storage structure, as I believe it defeats assumptions that some tools make about rows of a frame.我建议您考虑这种存储结构,因为我相信它违背了某些工具对框架行所做的假设。

Use: df_new is your new dataframe.使用:df_new 是你的新 dataframe。

df_new = df[df['mb_gene'] == df['ts_gene']]

Without more details, it's hard to know about edge cases.没有更多细节,很难了解边缘情况。 In any case, it sounds like a relational table join.无论如何,这听起来像是一个关系表连接。 Have you tried:你有没有尝试过:

d1 = select(df, c(mb_rank, mb_gene))
d2 = select(df, c(ts_rank, ts_gene))
merge(d1, d2, by.x="mb_gene", by.y="ts_gene")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较数据框中的 2 列并在数据框中创建新列 - comparing 2 columns in a data-frame and creating a new column in data frame 将 data.frame 中的值添加到另一个 data.frame 中匹配两个条件的新列 - add values from data.frame to a new column in another data.frame that matches two criteria 子集R数据帧基于两列中的字符串匹配 - Subset R data frame based on string matches in two columns 获取数据帧的两列之间的顺序匹配并返回列表 - Obtaining sequential matches between two columns of a data frame returning a list 从data.frame创建新列 - Creating a new columns from a data.frame 通过根据另一个数据框中列的值从一个数据框中提取列来创建新数据框 - creating a new data frame by extracting columns from one data frame based on the value of column in another data frame 如何对不同数据帧的列之间的匹配进行 for 循环测试,然后保存到新的数据帧 - How to make a for loop test for matches between columns of different data frames and then save to a new data frame 比较两个不同数据框中的列 - Comparing columns in two different data frame 创建一个在两个数据框之间具有匹配和不匹配的数据框 - Create a data frame with matches and mismatches between two data frames 基于数据框中的两个旧列创建新列 - Creating a new column based on two old columns in a data frame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM