简体   繁体   中英

How to combine several similar .csv files into one dataframe with given structure

I have many .csv files which are similar in structure:

1.csv

Type n
A   1
B   20
C   34
D   5
...

2.csv

Type n
A   2
B   15
C   16
D   5
...

I want to combine them in something like:

Type  n1   n2
  A   1    2
  B   20   15
  C   34   16
  D   5    5
  ...

When I use lapply I get

 Type n  Type   n
  A   1    A    2
  B   20   B    15
  C   34   C    16
  D   5    D    5
  ...

Is there any simple way to combine them properly?

I'm open for solutions in either R or Python

Interpretation 1: Identical data structure for each CSV

Here are two options to consider if the structure is identical , but first some sample data:

cat("Type n", "A  1", "B  20", "C  34", "D  5", sep = "\n", file = "myfile1.txt")
cat("Type n", "A  2", "B  15", "C  16", "D  5", sep = "\n", file = "myfile2.txt")

Option 1: Drop the first column when you're reading the data in by using "NULL" (with quotes) as the colClasses for the column that needs to be dropped. Use cbind to put the files together.

x <- read.table("myfile1.txt", header=TRUE)
y <- read.table("myfile2.txt", header=TRUE, colClasses=c("NULL", "numeric"))
cbind(x, y)
#   Type  n  n
# 1    A  1  2
# 2    B 20 15
# 3    C 34 16
# 4    D  5  5

## For more files:
## do.call(cbind, list(x, y, ...))

Option 2: Read the files in normally, then subset with a c(FALSE, TRUE) vector, put everything in a list and cbind together with the first column from any of the objects.

x1 <- read.table("myfile1.txt", header = TRUE)
y1 <- read.table("myfile2.txt", header = TRUE)

fileList <- list(x1, y1)
cbind(x1[1], do.call(cbind, fileList)[c(FALSE, TRUE)])
#   Type  n n.1
# 1    A  1   2
# 2    B 20  15
# 3    C 34  16
# 4    D  5   5

Of course, the above are just minimal examples. I'm presuming that you actually have more than 2 columns in each file. Use a vector of TRUE s and FALSE s that actually match your columns to keep and drop (respectively) for the second option, and "NULL" and object classes for the first option.


Interpretation 2: Similar data structure for each CSV

If the data structure are similar but not identical , you might need to use merge instead. Consider the following sample data. The first three files have the same structure, but the fourth one, "myfile4.txt" has A, B, D, and E as the "Type" values, while the other three have "A", "B", "C", and "D"

cat("Type n", "A  1", "B  20", "C  34", "D  5", sep = "\n", file = "myfile1.txt")
cat("Type n", "A  2", "B  15", "C  16", "D  5", sep = "\n", file = "myfile2.txt")
cat("Type n", "A  1", "B   5", "C   6", "D  7", sep = "\n", file = "myfile3.txt")
cat("Type n", "A  8", "B   9", "D  11", "E  0", sep = "\n", file = "myfile4.txt")

Here's how we can tackle this.

  1. Bulk read in the files:

     x <- list.files(pattern="myfile") y <- lapply(x, read.table, header = TRUE) 
  2. Multiple merge s will probably result in an error if it can't make unique names. Help merge out by making unique names for the non-id columns to start.

     library(data.table) ## for `setnames` ## setnames will silently assign new names ## to the original data in list "y" invisible(lapply(seq_along(y), function(z) setnames(y[[z]], "n", paste("n", z, sep = "_")))) 
  3. Use Reduce to merge the list items together using the "Type" column as the "id".

     Reduce(function(x, y) merge(x, y, by = "Type", all = TRUE), y) # Type n_1 n_2 n_3 n_4 # 1 A 1 2 1 8 # 2 B 20 15 5 9 # 3 C 34 16 6 NA # 4 D 5 5 7 11 # 5 E NA NA NA 0 

in Python you should use pandas to perform these operations:

import pandas as pd

df1 = pd.read_csv('1.csv', sep='\s+', index_col=0)
df2 = pd.read_csv('2.csv', sep='\s+', index_col=0)

pd.concat([df1, df2], axis=1)
Out[16]: 
       n   n
Type        
A      1   2
B     20  15
C     34  16
D      5   5

If you expect more automated columns renaming:

pd.merge(df1, df2, left_index=True, right_index=True, suffixes=['1', '2'])
Out[20]: 
      n1  n2
Type        
A      1   2
B     20  15
C     34  16
D      5   5

Another solution here assuming no merging needs to be done. If you have three files for example, you can read them in like this:

n <- 1:3
x <- lapply(sprintf('%s.csv', n), read.csv)

You just want to drop the first column in every table, so you can use sapply() on [[.data.frame to remove the unwanted column, and then combine it all into one data frame.

data.frame(Type = x[[1]]$Type, sapply(x, '[[', -1))

Or if you really want the names in the form n1 , n2 etc.:

data.frame(
  Type = x[[1]]$Type, 
  setNames(lapply(x, '[[', -1), paste0('n', n))
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM