I have 3 large csv files (OCA1 = 3649 observations, 521 variables, OCA2 = 3772 observations, 2513 variables, OCA3 = 878 observations, 2513 variables). I want to combine them into 1 csv file in R. My only concern is they have different columns however, the first 10 columns for each file are the same. Here is an example:
As you can see, up until "Format" the column names are the same. What I would like is for the desired output to look like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA06984 NA006985 HG00096 HG00097
11 891... rs.. A G 100 PASS .. GT 0|0 0|0
11 891... rs.. A G 100 PASS .. GT 0|0 0|0
Where the columns after "Format" from OCA2 get added to OCA1 and the rows from OCA2 get added after the last OCA1 observation (3649).
I initially tried rbind but I was struggling due to the columns.
bind_rows
from dplyr
can help bind data sets with different number of columns. Here is an example:
library(dplyr)
OCA1 <- data_frame(
x = 1:3
)
OCA2 <- data_frame(
x = 1:5,
y = letters[1:5]
)
OCA3 <- data_frame(
x = 1:10,
y = letters[1:10],
z = LETTERS[1:10]
)
df <- bind_rows(
OCA1,
OCA2,
OCA3
)
Maybe you could create NA columns for missing ones, and then do rbind
df_l = list(df1 = data.frame('A'=rep("f1",10),'B'=runif(10),'C'=seq(1:10)),
df2 = data.frame('A'=rep("f2",20),'B'=runif(20),'D'=paste0("X",seq(1:20))),
df3 = data.frame('A'=rep("f3",30),'C'=seq(1:30),'D'=paste0("Y",seq(1:30))))
all_names = unique(c(colnames(df_l[['df1']]),
colnames(df_l[['df2']]),
colnames(df_l[['df3']])))
for (i in names(df_l)) {
abs_col = all_names[!all_names %in% names(df_l[[i]])]
if(length(abs_col) > 0) df_l[[i]][,abs_col] <- NA
} ; rm(i)
do.call("rbind", df_l)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.