簡體   English   中英

通過搜索和匹配字符串來連接兩個數據框

[英]Join two data frames by searching & matching strings

我有兩個數據框

df1

+-------+---------+  
|   Id  |  Title  |
+-------+---------+  
|   1   |   AAA   |
|   2   |   BBB   |
|   3   |   CCC   |
+-------+---------+

df2

+-------+---------------+------------------------------------+
|   Id  |      Sub      |               Body                 |
+-------+---------------+------------------------------------+  
|   1   |   some sub1   | some mail body AAA some text here  |
|   2   |   some sub2   | some text here BBB continues here  |
|   3   |   some sub3   | some text AAA present here         |
|   4   |   some sub4   | AAA string is present here also    |
|   5   |   some sub5   | CCC string is present here         |
+-------+---------------+------------------------------------+

我想將df1中的Titledf2Body列進行匹配,
如果“正文”列中存在標題字符串,則應將兩行連接在一起,輸出數據框應類似於:

df3

+----------+---------------+------------------------------------+
|   Title  |      Sub      |               Body                 |
+----------+---------------+------------------------------------+  
|   AAA    |   some sub1   | some mail body AAA some text here  |
|   BBB    |   some sub2   | some text here BBB continues here  |
|   AAA    |   some sub3   | some text AAA present here         |
|   AAA    |   some sub4   | AAA string is present here also    |
|   CCC    |   some sub5   | CCC string is present here         |
+----------+---------------+------------------------------------+

一種解決方案可能看起來像這樣,盡管經驗豐富的R用戶可能會提出更好的答案

# set up test data
df1 <- data.frame(stringsAsFactors = F,
                  id = 1:3,
                  title = c('AAA', 'BBB', 'CCC'))
df2 <- data.frame(stringsAsFactors = F,
                  id = 1:5,
                  sub = c('some sub1', 'some sub2', 'some sub3', 'some sub4', 'some sub5'),
                  body = c('some mail body AAA some text here',
                           'some text here BBB continous here',
                           'some text AAA present here',
                           'AAA string is present here also',
                           'CCC string is present here'))

# join data frames
df.list <- lapply(1:nrow(df1), function (idx) cbind(title=df1[idx,2], df2[grepl(df1$title[idx], df2$body), 2:3]))
do.call('rbind', df.list)

這將導致以下輸出

  title       sub                              body
1   AAA some sub1 some mail body AAA some text here
3   AAA some sub3        some text AAA present here
4   AAA some sub4   AAA string is present here also
2   BBB some sub2 some text here BBB continous here
5   CCC some sub5        CCC string is present here

由於評論而更新:

如果我們不能依靠每個標題將與df2中的某些行匹配的事實,那么您可能想要做這樣的事情

# set up test data
df1 <- data.frame(stringsAsFactors = F,
                  id = 1:4,
                  title = c('AAA', 'AAA BB', 'BBB', 'CCC'))
df2 <- data.frame(stringsAsFactors = F,
                  id = 1:5,
                  sub = c('some sub1', 'some sub2', 'some sub3', 'some sub4', 'some sub5'),
                  body = c('some mail body AAA some text here',
                           'some text here BBB continous here',
                           'some text AAA present here',
                           'AAA string is present here also',
                           'CCC string is present here'))

MergeByTitle <- function(title.idx) {
  df2.hits <- df2[grepl(df1$title[title.idx], df2$body), 2:3]
  if (nrow(df2.hits) > 0)
    cbind(title=df1[title.idx,2], df2.hits)
}

# join data frames
df.list <- lapply(1:nrow(df1), MergeByTitle)
do.call('rbind', df.list)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM