简体   繁体   English

合并具有重复列的两个数据帧

[英]Merge two dataframes with repeated columns

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. 我有几个.csv文件,每个文件对应一个月的客户列表和一些有关它们的信息。 Each file consists of the same information about customers such as: 每个文件都包含有关客户的相同信息,例如:

names(data.jan)

ID     AGE      CITY      GENDER

names(data.feb)

ID     AGE      CITY      GENDER

To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november: 为了简化,我将只考虑两个月,一月和二月,但我真正的csv文件集从1月到11月:

Considering a "customer X",I have three possible scenarios: 考虑到“客户X”,我有三种可能的情况:

1- Customer X is listed in the january database, but he left and now is not listed in february 2- Customer X is listed in both january and february databases 3- Customer X entered the database in february, so he is not listed in january 1-客户X列在1月数据库中,但他已离开,现在未在2月2日列出客户X在1月和2月数据库中列出3-客户X在2月进入数据库,因此他未列入1月

I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. 我遇到了以下问题:我需要创建一个包含所有客户的数据库及其在两个数据框中列出的相应信息。 However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january. 但是,考虑到两个数据框中列出的客户,我想从他的第一个条目,即1月份中选择他的信息。

When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html 当我使用merge时,我有四个选项,根据http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r。 HTML

合并选项

data <- merge(data.jan,data.feb, by="ID", all=TRUE)

Regardless of which all, all.x or all.y I choose, I get the same undesired output called data: 无论我选择allx,all.x还是all.y,我都会获得相同的不需要的输出,称为数据:

data[1,]

ID     AGE.x      CITY.x      GENDER.x       AGE.y      CITY.y      GENDER.y
123      25         NY           M            25          NY            M

I think that what would work here is to merge both databases with this type of join: 我认为这里可行的是将这两种数据库与这种类型的连接合并:

在此输入图像描述

Then, merge the resulting dataframe with data.jan with the full outer join. 然后,将结果数据帧与data.jan合并为完整外连接。 But I don't know how to code this in R. 但我不知道如何在R中编码。

Thanks, 谢谢,

Bernardo 贝尔纳

 d1 <- data.frame(x=1:9,y=1:9,z=1:9)
 d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
 d3 <- merge(d1,d2, by="x", all=TRUE) #merge


# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#>  d3[, 1:3]
#    x y.x z.x
#1   1   1   1
#2   2   2   2
#3   3   3   3
#4   4   4   4
#5   5   5   5
#6   6   6   6
#7   7   7   7
#8   8   8   8
#9   9   9   9
#10 10  20  30

This may be tiresome for more than 2 months though, perhaps you should consider @flodel's comments, also note there are demons when your original Jan data has NA s (and you still want the first months data, NA or not, retained) although you never mentioned them in your question. 这可能是两个多月的烦人,但也许你应该考虑@ flodel的评论,还要注意当你的原始Jan数据有NA时你会有恶魔(并且你仍然想要保留第一个月数据, NA或不保留)尽管你从来没有在你的问题中提到它们

Try: 尝试:

data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")

although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames 虽然我没有测试它,因为没有数据,但如果你只是从2月加入ID col,它应该只过滤掉两个帧中没有的任何东西

@user1317221_G's solution is excellent. @ user1317221_G的解决方案非常棒。 If your tables are large (lots of customers), data tables might be faster: 如果您的表很大(很多客户),数据表可能会更快:

library(data.table)
#  some sample data
jan <- data.table(id=1:10,  age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)

setkey(jan,id)
setkey(feb,id)

join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]

Edit : This adds processing for multiple months. 编辑 :这会增加多个月的处理。

f <- function(x,y) {
  setkey(x,id)
  setkey(y,id)
  join <- data.table(merge(x,y,by="id",all=T))
  join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
  join[,names(join)[5:7]:=NULL]                # get rid of extra columns
  setnames(join,2:4,c("age","city","gender"))  # rename columns that remain
  return(join)
}

Reduce("f",list(jan,feb,mar))

Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb , and then to the result and mar , etc. Reduce(...)依次将函数f(...)应用于列表的元素,所以首先是janfeb ,然后是结果和mar ,等等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM