比较数据框中的两列是否匹配，并由此创建一个包含匹配项的新数据框

Question

please can you help me again?请问你能再帮我一次吗？

I have a data frame that contains 4 columns, which are either a gene symbol or a rank that I have assigned the gene symbol like this:我有一个包含 4 列的数据框，它们是基因符号或我已分配基因符号的等级，如下所示：

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  1        BIRCA    1        MYCN
[2]  2        MYCN     2        MOB4
[3]  3        ATXN1    3        ABHD17C
[4]  4        ABHD17C  4        AEBP2
5 etc... for up to 6000 rows in some data sets. 
the ts columns are usually a lot longer than the mb columns.

I want to arrange the data so that non-duplicates are removed thereby leaving only genes that appear in both columns of the data frame eg我想安排数据，以便删除非重复项，从而只留下出现在数据框两列中的基因，例如

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  2        MYCN     1        MYCN
[2]  4        ABHD17C  3        ABHD17C

In this example of the desired outcome, the non-duplicated genes have been removed leaving only genes that appeared in both lists to begin with.在这个期望结果的例子中，非重复基因已被删除，只留下出现在两个列表中的基因。

I have tried many things like:我尝试了很多事情，例如：

`df[df$mb_gene %in% df$ts_gene,]`

but it doesn't work and seems to hit and miss some gene 2) I attempted to write an IF function but my skills are to limited.但它不起作用，似乎碰巧错过了一些基因 2）我试图写一个IF function 但我的技能有限。

I hope I have described this well enough but if I can clarify anything please ask, I'm really stuck.我希望我已经很好地描述了这一点，但如果我能澄清任何事情，请问，我真的被卡住了。 Thanks in advance!提前致谢！

Answer 1

In a data.frame , typically a row is a complete observation, meaning that all data in it correlates (somehow) with the rest.在data.frame中，通常一行是完整的观察，这意味着其中的所有数据（以某种方式）与 rest 相关。 In a survey, one row is either one person (all questions) or one question for one person.在调查中，一行是一个人（所有问题）或一个人一个问题。 In your data here, though, your first row BIRCA and MYCN are completely separate, meaning you want to remove one without removing the other.但是，在您的数据中，您的第一行BIRCA和MYCN是完全分开的，这意味着您想要删除一个而不删除另一个。 In a "data-science-y" view, this to me suggests your data is improperly shaped.在“数据科学-y”视图中，这对我来说表明您的数据形状不正确。

In order to do what you want, we need to split them into separate frames.为了做你想做的事，我们需要将它们分成单独的框架。

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
mb_rank  mb_gene  ts_rank  ts_gene
1        BIRCA    1        MYCN
2        MYCN     2        MOB4
3        ATXN1    3        ABHD17C
4        ABHD17C  4        AEBP2")

df1 <- df[,1:2]
df2 <- df[,3:4]
df1
#   mb_rank mb_gene
# 1       1   BIRCA
# 2       2    MYCN
# 3       3   ATXN1
# 4       4 ABHD17C
df2
#   ts_rank ts_gene
# 1       1    MYCN
# 2       2    MOB4
# 3       3 ABHD17C
# 4       4   AEBP2

From here, we can use intersect to find genes in common:从这里，我们可以使用intersect来查找共同的基因：

incommon <- intersect(df1$mb_gene, df2$ts_gene)
df1[df1$mb_gene %in% incommon,]
#   mb_rank mb_gene
# 2       2    MYCN
# 4       4 ABHD17C
df2[df2$ts_gene %in% incommon,]
#   ts_rank ts_gene
# 1       1    MYCN
# 3       3 ABHD17C

If you are 100% certain that you will always have the same number of rows in each, then you can merely cbind these together:如果您 100% 确定每个行中的行数始终相同，那么您只需将它们cbind在一起：

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[df2$ts_gene %in% incommon,]
)
#   mb_rank mb_gene ts_rank ts_gene
# 2       2    MYCN       1    MYCN
# 4       4 ABHD17C       3 ABHD17C

However, if there is a chance that there will be different numbers in each, then you will run into problems.但是，如果每个都有不同的数字，那么您将遇到问题。 If the number of one is a multiple of the other, you will get "recycling" of data and a warning, but you will still get data (which I think is a mistake):如果一个的数量是另一个的倍数，你会得到数据的“回收”和警告，但你仍然会得到数据（我认为这是一个错误）：

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2
)
# Warning in data.frame(..., check.names = FALSE) :
#   row names were found from a short variable and have been discarded
#   mb_rank mb_gene ts_rank ts_gene
# 1       2    MYCN       1    MYCN
# 2       4 ABHD17C       2    MOB4
# 3       2    MYCN       3 ABHD17C
# 4       4 ABHD17C       4   AEBP2

If not a multiple, though, you'll just get an error:但是，如果不是倍数，您只会得到一个错误：

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[1:3,]
)
# Error in data.frame(..., check.names = FALSE) : 
#   arguments imply differing number of rows: 2, 3

I suggest that you think about this storage structure, as I believe it defeats assumptions that some tools make about rows of a frame.我建议您考虑这种存储结构，因为我相信它违背了某些工具对框架行所做的假设。

Answer 2

Use: df_new is your new dataframe.使用：df_new 是你的新 dataframe。

df_new = df[df['mb_gene'] == df['ts_gene']]

Answer 3

Without more details, it's hard to know about edge cases.没有更多细节，很难了解边缘情况。 In any case, it sounds like a relational table join.无论如何，这听起来像是一个关系表连接。 Have you tried:你有没有尝试过：

d1 = select(df, c(mb_rank, mb_gene))
d2 = select(df, c(ts_rank, ts_gene))
merge(d1, d2, by.x="mb_gene", by.y="ts_gene")

比较数据框中的两列是否匹配，并由此创建一个包含匹配项的新数据框

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-04-16 15:47:42

解决方案2
0 2020-04-16 15:50:33

解决方案3
0 2020-04-16 15:59:43

比较数据框中的两列是否匹配，并由此创建一个包含匹配项的新数据框

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-04-16 15:47:42

解决方案2 0 2020-04-16 15:50:33

解决方案3 0 2020-04-16 15:59:43

解决方案1
1 已采纳 2020-04-16 15:47:42

解决方案2
0 2020-04-16 15:50:33

解决方案3
0 2020-04-16 15:59:43