[英]How to merge multiple dataframes with different column names
I have a two data frame, say 'df1' and 'df2'. 我有两个数据框,分别是“ df1”和“ df2”。 df1 has the following column: df1包含以下列:
Date
and df2 has the following columns: df2包含以下列:
Date.1, USD.Price, Date.2, EUR.Price, Date.3, JPY.Price, Date.4, INR.Price
where Date, Date.1, Date.2, Date.3, Date.4 ... is in date format. 其中Date,Date.1,Date.2,Date.3,Date.4 ...为日期格式。
Now I want to merge Date.1, USD.Price with df1 based on df1$Date and df2$Date.2 as: 现在我想将基于df1 $ Date和df2 $ Date.2的Date.1,USD.Price与df1合并为:
df3 = merge(df1, df2[,1:2], by.x = "Date", by.y = "Date.1", all = TRUE)
Then, 然后,
df4 = merge(df3, df2[,3:4], by.x = "Date", by.y = "Date.2", all = TRUE)
Then again, 再说一次
df5 = merge(df4, df2[,5:6], by.x = "Date", by.y = "Date.3", all = TRUE)
Furthermore, 此外,
df6 = merge(df5, df2[,7:8], by.x = "Date", by.y = "Date.4", all = TRUE)
and so on for all 1000 such columns. 以此类推,对于所有1000个此类列。
For example, lets say, I have a following dataframe: 例如,假设我有一个以下数据框:
df1: df1:
Date
2009-10-13
2009-10-14
2009-10-16
2009-10-18
2009-10-19
2009-10-20
2009-10-21
2009-10-22
and df2: 和df2:
Date.1 USD.Price Date.2 EUR.Price Date.3 JPY.Price Date.4 INR.Price
2009-10-13 21.6 NA NA NA NA NA NA
2009-10-14 21.9 2009-10-14 78.2 NA NA NA NA
2009-10-16 22.0 2009-10-16 78.5 NA NA 2009-10-16 12.2
NA NA 2009-10-18 78.9 2009-10-18 32.1 2009-10-18 12.4
NA NA NA NA 2009-10-19 32.6 2009-10-19 12.2
Then the output needs to be: 然后输出需要是:
Date USD.Price EUR.Price JPY.Price INR.Price
2009-10-13 21.6 NA NA NA
2009-10-14 21.9 78.2 NA NA
2009-10-16 22.0 78.5 NA NA
2009-10-18 NA 78.9 32.1 12.4
2009-10-19 NA NA 32.6 12.2
I have got some reference: How can I merge multiple dataframes with the same column names? 我有一些参考: 如何合并具有相同列名的多个数据框?
But in my case column names are different as Date.1, Date.2, Date.3 etc... 但是在我的情况下,列名与Date.1,Date.2,Date.3等不同。
Can anyone please help me out how to do this for around 1000 columns aa doing as above is not scalable for many columns? 任何人都可以帮我解决大约1000列的问题。如上所述,对于许多列而言,它是不可扩展的?
Thanks 谢谢
You can try a recursive function (a function that calls itself). 您可以尝试递归函数(一个调用自身的函数)。
It takes two data.frames and a column index. 它需要两个data.frames和一个列索引。 It merges the data.frames
based on the first column of df1
and the first column of df2
that is subsetted using the idx
. 它基于df1
的第一列和使用idx
的df2
的第一列合并data.frames
。 Then it calls itself using the new data.frame dfx
and df2
while idx is less then the number of columns in df2
- 1. 然后,它使用新的data.frame自称dfx
和df2
而IDX是小于在列数df2
- 1。
merge_df <- function(df1, df2, idx) {
dfx <- merge(df1, df2[, idx:(idx + 1)], by.x = names(df1)[1],
by.y = names(df2)[idx])
if (idx < ncol(df2) - 1) {
return(merge_df(dfx, df2, idx + 2))
} else {
return(dfx)
}
}
You can use it like this: 您可以像这样使用它:
df1 <- data.frame(id = 1:10)
df2 <- data.frame(id1 = 1:10,
test1 = letters[1:10],
id2 = 1:10,
test2 = LETTERS[1:10])
df <- merge_df(df1, df2, 1)
This would result in this: 这将导致:
head(df, 10)
id test1 test2
1 1 a A
2 2 b B
3 3 c C
4 4 d D
5 5 e E
6 6 f F
7 7 g G
8 8 h H
9 9 i I
10 10 j J
You could do this... 你可以做...
datecols <- grep("Date", names(df)) #get date columns
dfDates <- apply(df[,datecols], 1, function(x) x[!is.na(x)][1]) #vector of dates
df2 <- cbind(Date=dfDates, df[,-datecols]) #bind dates to non-date columns
df2
Date USD.Price EUR.Price JPY.Price INR.Price
1 2009-10-13 21.6 NA NA NA
2 2009-10-14 21.9 78.2 NA NA
3 2009-10-16 22.0 78.5 NA 12.2
4 2009-10-18 NA 78.9 32.1 12.4
5 2009-10-19 NA NA 32.6 12.2
Maybe this loop could help you out: 也许这个循环可以帮助您:
for(n in 1:999){
assign(paste('df',n+2,sep = ''),
merge(get(paste('df',n,sep = '')), get(paste('df',n+1,sep = ''))[,n:n+1],
by.x = 'Date', by.y = paste('Date',n,sep = '.'), all = TRUE),
envir = .GlobalEnv)
}
An efficient way of doing this using sqldf
I think. 我认为使用sqldf
做到这一点的有效方法。
# Changing column names in df2 for convenience
names(df2) <- c("Date1", "USD_Price", "Date2", "EUR_Price", "Date3", "JPY_Price", "Date4", "INR_Price")
library(sqldf)
sqldf({"
SELECT D1.Date, D2.USD_Price, D2.EUR_Price, D2.JPY_Price, D2.INR_Price FROM df1 AS D1
INNER JOIN df2 AS D2
ON D1.Date IN (D2.Date1, D2.Date2, D2.Date3, D2.Date4)
"})
# Date USD_Price EUR_Price JPY_Price INR_Price
#1 2009-10-13 21.6 NA NA NA
#2 2009-10-14 21.9 78.2 NA NA
#3 2009-10-16 22.0 78.5 NA 12.2
#4 2009-10-18 NA 78.9 32.1 12.4
#5 2009-10-19 NA NA 32.6 12.2
Here's a tidyverse way using your example df1
and df2
with the date columns processed with lubridate
: 这是使用示例df1
和df2
以及用lubridate
处理的日期列的一种整理方法:
library(tidyr)
library(dplyr)
library(lubridate)
# reformat df2
df2bis <-
df2 %>%
gather(key = "tmp_key",
value = "Date",
starts_with("Date"),
na.rm = TRUE) %>%
select(-tmp_key) %>%
distinct()
# and merge with df1
df <- inner_join(df1, df2bis)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.