Combine some csv files into one - different number of columns

Question

I already loaded 20 csv files with function:

tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))

or

list_of_data = lapply(tbl, read.csv)

That how it looks like:

> head(tbl)
[1] "F1.csv"          "F10_noS3.csv"    "F11.csv"         "F12.csv"         "F12_noS7_S8.csv"
[6] "F13.csv"

I have to combine all of those files into one. Let's call it a master file but let's try with making a one table with all of the names. In all of those csv files is a column called "Accession". I would like to make a table of all "names" from all of those csv files. Of course many of the accessions can be repeated in different csv files. I would like to keep all of the data corresponding to the accession.

Some problems:

Some of those "names" are the same and I don't want to duplicate them
Some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the numer.
The number of columns can be different is those csv files.

That's the screenshot showing how those data looks like: http://imageshack.com/a/img811/7103/29hg.jpg

Let me show you how it looks:

AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--

<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.

Is it possible to do ?

I couldn't do a dput(head) because it's even too big data set.

I tried to use such code:

all_data = do.call(rbind, list_of_data)
Error in rbind(deparse.level, ...) : 
The number of columns is not correct.


all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))

I tried to do it for almost 2 weeks and I am not able to. So please help me.

Answer 1

Your questions seems to contain multiple subquestions. I encourage you to separate them.

The first thing you apparently need is to combine data frames with different columns. You can use rbind.fill from the plyr package:

library(plyr)
all_data = do.call(rbind.fill, list_of_data)

Answer 2

Here's an example using some tidyverse functions and a custom function that can combine multiple csv files with missing columns into one data frame:

library(tidyverse)

# specify the target directory
dir_path <- '~/test_dir/' 

# specify the naming format of the files. 
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'

# create sample data with some missing columns 
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)

# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
  x <- read_csv(paste0(dir_path, file_name)) %>% 
    mutate(file_name = file_name) %>% # add the file name as a column              
    select(file_name, everything())   # reorder the columns so file name is first
  return(x)
}

# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
  list.files(dir_path, pattern = re_file) %>% 
  map_df(~ read_dir(dir_path, .))

# files with missing columns are filled with NAs.

Combine some csv files into one - different number of columns

Question

2 answers

solution1
2 ACCPTED 2014-02-06 16:15:18

solution2
0 2018-07-15 12:54:05

Combine some csv files into one - different number of columns

Question

2 answers

solution1 2 ACCPTED 2014-02-06 16:15:18

solution2 0 2018-07-15 12:54:05

solution1
2 ACCPTED 2014-02-06 16:15:18

solution2
0 2018-07-15 12:54:05