简体   繁体   English

如何合并具有不同列名称的多个数据框

[英]How to merge multiple dataframes with different column names

I have a two data frame, say 'df1' and 'df2'. 我有两个数据框,分别是“ df1”和“ df2”。 df1 has the following column: df1包含以下列:

Date

and df2 has the following columns: df2包含以下列:

Date.1, USD.Price, Date.2, EUR.Price, Date.3, JPY.Price, Date.4, INR.Price

where Date, Date.1, Date.2, Date.3, Date.4 ... is in date format. 其中Date,Date.1,Date.2,Date.3,Date.4 ...为日期格式。

Now I want to merge Date.1, USD.Price with df1 based on df1$Date and df2$Date.2 as: 现在我想将基于df1 $ Date和df2 $ Date.2的Date.1,USD.Price与df1合并为:

df3 = merge(df1, df2[,1:2],  by.x = "Date", by.y = "Date.1", all = TRUE)

Then, 然后,

df4 = merge(df3, df2[,3:4],  by.x = "Date", by.y = "Date.2", all = TRUE)

Then again, 再说一次

df5 = merge(df4, df2[,5:6],  by.x = "Date", by.y = "Date.3", all = TRUE)

Furthermore, 此外,

df6 = merge(df5, df2[,7:8],  by.x = "Date", by.y = "Date.4", all = TRUE)

and so on for all 1000 such columns. 以此类推,对于所有1000个此类列。

For example, lets say, I have a following dataframe: 例如,假设我有一个以下数据框:

df1: df1:

Date
2009-10-13
2009-10-14
2009-10-16
2009-10-18
2009-10-19
2009-10-20
2009-10-21
2009-10-22

and df2: 和df2:

 Date.1      USD.Price   Date.2       EUR.Price     Date.3       JPY.Price      Date.4           INR.Price     
 2009-10-13  21.6        NA           NA            NA            NA         NA                   NA 
 2009-10-14  21.9        2009-10-14   78.2          NA            NA         NA                   NA 
 2009-10-16  22.0        2009-10-16   78.5          NA             NA        2009-10-16           12.2
 NA          NA          2009-10-18   78.9          2009-10-18  32.1       2009-10-18             12.4
NA           NA           NA          NA            2009-10-19  32.6      2009-10-19             12.2  

Then the output needs to be: 然后输出需要是:

Date           USD.Price    EUR.Price    JPY.Price    INR.Price
2009-10-13     21.6         NA           NA           NA
2009-10-14     21.9         78.2         NA           NA
2009-10-16     22.0         78.5         NA           NA
2009-10-18     NA           78.9         32.1         12.4
2009-10-19     NA           NA           32.6         12.2 

I have got some reference: How can I merge multiple dataframes with the same column names? 我有一些参考: 如何合并具有相同列名的多个数据框?

But in my case column names are different as Date.1, Date.2, Date.3 etc... 但是在我的情况下,列名与Date.1,Date.2,Date.3等不同。

Can anyone please help me out how to do this for around 1000 columns aa doing as above is not scalable for many columns? 任何人都可以帮我解决大约1000列的问题。如上所述,对于许多列而言,它是不可扩展的?

Thanks 谢谢

You can try a recursive function (a function that calls itself). 您可以尝试递归函数(一个调用自身的函数)。

It takes two data.frames and a column index. 它需要两个data.frames和一个列索引。 It merges the data.frames based on the first column of df1 and the first column of df2 that is subsetted using the idx . 它基于df1的第一列和使用idxdf2的第一列合并data.frames Then it calls itself using the new data.frame dfx and df2 while idx is less then the number of columns in df2 - 1. 然后,它使用新的data.frame自称dfxdf2而IDX是小于在列数df2 - 1。

merge_df <- function(df1, df2, idx) {

  dfx <- merge(df1, df2[, idx:(idx + 1)], by.x = names(df1)[1], 
               by.y = names(df2)[idx])

  if (idx < ncol(df2) - 1) {
    return(merge_df(dfx, df2, idx + 2))
  } else {
    return(dfx)
  }
}

You can use it like this: 您可以像这样使用它:

df1 <- data.frame(id = 1:10)
df2 <- data.frame(id1 = 1:10,
                  test1 = letters[1:10],
                  id2 = 1:10,
                  test2 = LETTERS[1:10])


df <- merge_df(df1, df2, 1)

This would result in this: 这将导致:

head(df, 10)
   id test1 test2
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D
5   5     e     E
6   6     f     F
7   7     g     G
8   8     h     H
9   9     i     I
10 10     j     J

You could do this... 你可以做...

datecols <- grep("Date", names(df)) #get date columns

dfDates <- apply(df[,datecols], 1, function(x) x[!is.na(x)][1]) #vector of dates

df2 <- cbind(Date=dfDates, df[,-datecols]) #bind dates to non-date columns

df2
        Date USD.Price EUR.Price JPY.Price INR.Price
1 2009-10-13      21.6        NA        NA        NA
2 2009-10-14      21.9      78.2        NA        NA
3 2009-10-16      22.0      78.5        NA      12.2
4 2009-10-18        NA      78.9      32.1      12.4
5 2009-10-19        NA        NA      32.6      12.2

Maybe this loop could help you out: 也许这个循环可以帮助您:

for(n in 1:999){
  assign(paste('df',n+2,sep = ''),
         merge(get(paste('df',n,sep = '')), get(paste('df',n+1,sep = ''))[,n:n+1],  
               by.x = 'Date', by.y = paste('Date',n,sep = '.'), all = TRUE),
         envir = .GlobalEnv)
}

An efficient way of doing this using sqldf I think. 我认为使用sqldf做到这一点的有效方法。

# Changing column names in df2 for convenience
names(df2) <- c("Date1", "USD_Price", "Date2", "EUR_Price", "Date3", "JPY_Price", "Date4", "INR_Price")

library(sqldf) 
sqldf({"
    SELECT D1.Date, D2.USD_Price, D2.EUR_Price, D2.JPY_Price, D2.INR_Price FROM df1 AS D1
    INNER JOIN df2 AS D2
    ON D1.Date IN (D2.Date1, D2.Date2, D2.Date3, D2.Date4)
"})

#        Date USD_Price EUR_Price JPY_Price INR_Price
#1 2009-10-13      21.6        NA        NA        NA
#2 2009-10-14      21.9      78.2        NA        NA
#3 2009-10-16      22.0      78.5        NA      12.2
#4 2009-10-18        NA      78.9      32.1      12.4
#5 2009-10-19        NA        NA      32.6      12.2

Here's a tidyverse way using your example df1 and df2 with the date columns processed with lubridate : 这是使用示例df1df2以及用lubridate处理的日期列的一种整理方法:

library(tidyr)
library(dplyr)
library(lubridate)

# reformat df2
df2bis <- 
  df2 %>%
  gather(key = "tmp_key",
         value = "Date",
         starts_with("Date"),
         na.rm = TRUE) %>%
  select(-tmp_key) %>%
  distinct()

 # and merge with df1
 df <- inner_join(df1, df2bis)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM