简体   繁体   中英

How to create a dataframe by importing data from multiple .csv files that are alike in contents?

I have been struggling with this issue for hours now and I can't seem to figure it out. I would really appreciate it for any input that would help.

Background

I am trying to automate data manipulation for my research lab in school through python. From the experiment, a .csv file containing 41 rows of data excluding header will be produced as seen below.

在此处输入图像描述

Sometimes, multiple runs of the same experiment exist and that will produce .csv files with the same header, and taking an average of them is needed for accuracy. Something like this with the same number of rows and headers:

在此处输入图像描述

So far I was able to filter the basenames to only contain the .csv files of the same parameters and have them added to a data frame. However, my issue is that I don't know how to continue to get an average.

My Current Code and output

Code:

import pandas as pd
import os

dir = "/Users/luke/Desktop/testfolder"

files = os.listdir(dir)
files_of_interests = {}

for filename in files:
    if filename[-4:] == '.csv':
        key = filename[:-5]
        files_of_interests.setdefault(key, [])
        files_of_interests[key].append(filename)

print(files_of_interests)

for key in files_of_interests:
    stack_df = pd.DataFrame()
    print(stack_df)
    for filename in files_of_interests[key]:
        stack_df = stack_df.append(pd.read_csv(os.path.join(dir, filename)))
    print(stack_df)

Output:

Empty DataFrame
Columns: []
Index: []
    Unnamed: 0  Wavelength       S2c  Wavelength.1        S2
0            0        1100  0.000342          1100  0.000304
1            1        1110  0.000452          1110  0.000410
2            2        1120  0.000468          1120  0.000430
3            3        1130  0.000330          1130  0.000306
4            4        1140  0.000345          1140  0.000323
..         ...         ...       ...           ...       ...
36          36        1460  0.002120          1460  0.001773
37          37        1470  0.002065          1470  0.001693
38          38        1480  0.002514          1480  0.002019
39          39        1490  0.002505          1490  0.001967
40          40        1500  0.002461          1500  0.001891

[164 rows x 5 columns]

Question Here!

So my question is, how do I get it to append towards the right individually for each S2c and S2 ?

Explanation:

With multiple.csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv file which led to the [164 rows x 5 columns] from the previous section. My original idea is to create a new data frame and only appending S2c and S2 from each of those .csv files such that instead of stacking on top of one another, it will keep appending them as new columns towards the right. Afterward, I can do some form of pandas column manipulation to have them added and divided by the number of runs (which are just the number of files, so len(files_of_interests[key]) under the second FOR loop).

What I have tried

  1. I have tried creating an empty data frame and adding a column that is taken from np.arange(1100,1500,10) using pd.DataFrame.from_records() . And append S2c and S2 to the data frame as I have described from the previous section. The same issue occurred, in addition to that, it produces a bunch of Nan values which I am not too well equipped to deal with even after searching further.

  2. I have read up on multiple other questions posted here, many suggested using pd.concat but since the answers are tailored to a different situation, I can't really replicate it nor do was I able to understand the documentation for it so I stopped pursuing this path.

Thank you in advance for your help!

Additional Info

I am using macOS and ATOM for the code.

The csv files can be found here!

github: https://github.com/teoyi/PROJECT-Automate-Research-Process

Trying out @zabop method

Code:

dflist = []
for key in files_of_interests:
    for filename in files_of_interests[key]:
        dflist.append(pd.read_csv(os.path.join(dir, filename)) )
concat = pd.concat(dflist, axis = 1)
concat.to_csv(dir + '/concat.csv')

Output:

在此处输入图像描述

Trying @SergeBallesta method

Code:

df = pd.concat([pd.read_csv(os.path.join(dir, filename))
                for key in files_of_interests for filename in files_of_interests[key]])

df = df.groupby(['Unnamed: 0', 'Wavelength', 'Wavelength.1']).mean().reset_index()
df.to_csv(dir + '/try.csv')
print(df)

Output:

在此处输入图像描述

If you have a list of dataframes, for example:

import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [3, 1, 2, 0]}
dflist = [pd.DataFrame.from_dict(data) for _ in range(5)]

You can do:

pd.concat(dflist,axis=1)

Which will look like:

在此处输入图像描述

If you want to append each column name with a number indicating which df they came from, before concat , do:

for index, df in enumerate(dflist):
    df.columns = [col+'_'+str(index) for col in df.columns]

Then pd.concat(dflist,axis=1) , resulting:

在此处输入图像描述


While I can't reproduce your file system & confirm that this works, to create the dflist above from you files, something like this should work:

dflist = []
for key in files_of_interests:
    print(stack_df)
    for filename in files_of_interests[key]:
        dflist.append( pd.read_csv(os.path.join(dir, filename)) )
        

IIUC you have:

  • a bunch of csv file, each containing a result from the same experiment
  • the first relevant column always contains numbers from 0 to 40 (so there are 41 lines per file)
  • the Wavelenght and Wavelength.1 columns always contain same values from 1100 to 1500 with a 10 increment
  • but additional columns may exist before the first relevant one
  • the first column has no name in the csv file, and up to the first relevant one names start with 'Unnamed: '

and you would like to get the average values of the S2 and S2c column for the same Wavelength value.

This can be done simply with groupby and mean , but we first have to filter out all the unnecessay columns. It can be made through the index_col and usecols parameter of read_csv :

...
print(files_of_interests)

# first concat the datasets:
dfs = [pd.read_csv(os.path.join(dir, filename), index_col=1,
                   usecols=lambda x: not x.startswith('Unnamed: '))
       for key in files_of_interests for filename in files_of_interests[key]]
df = pd.concat(dfs).reset_index()

# then take the averages
df = df.groupby(['Wavelength', 'Wavelength.1']).mean().reset_index()

# reorder columns and add 1 to the index to have it to run from 1 to 41
df = df.reindex(columns=['Wavelength', 'S2c', 'Wavelength.1', 'S2'])
df.index += 1

If there are still unwanted columns in resulting df, this magic command will help to identify the original files having a weird struct:

import pprint

pprint.pprint([df.columns for df in files])

With the files from github testfolder, it gives:

[Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Unnamed: 0.1', 'Wavelength', 'S2c', 'Wavelength.1',
       'S2'],
      dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object')]

It makes clear that the fifth file as an additional columns.

Turns out both @zabop and @SergeBallesta have provided me with valuable insights on to work on this issue through pandas.

What I wanted to have:

  1. The respective S2c and S2 columns of each file within the key:value pairs to be merged into one .csv file for further manipulation.

  2. Remove redundant columns to only show a single column of Wavelength that ranges from 1100 to 1500 with an increment of 10.

This requires the use of pd.concat which was introduced by @zabop and @SergeBallesta as shown below:

for key in files_of_interests:
    list = []
    for filename in files_of_interests[key]:
        list.append(pd.read_csv(os.path.join(dir,filename)))
        df = pd.concat(list, axis = 1)
        df = df.drop(['Unnamed: 0', 'Wavelength.1'], axis = 1)
        print(df)
        df.to_csv(os.path.join(dir + '/', f"{filename[:-5]}_master.csv"))

I had to use files_of_interests[key] for it to be able to read the filenames and have pd.read_csv to read the correct path. Other than that, I added axis = 1 to pd.concat which allows it to be concatenated horizontally along with for loops to access the filenames correctly. (I have double-checked the values and they do match up with the respective .csv files.)

The output to .csv looks like this:

在此处输入图像描述

The only issue now is that groupby as suggested by @SergeBallesta did not work as it returns ValueError: Grouper for 'Wavelength' not 1-dimensional . I will be creating a new question for this if I make no progress by the end of the day.

Once again, a big thank you to @zabop and @SergeBallesta for giving this a try though my explanation wasn't too clear, their answers have definitely provided me with the much-needed insight of how pandas work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM