如何通过从多个内容相似的 csv 文件中导入数据来创建 dataframe？

Question

I have been struggling with this issue for hours now and I can't seem to figure it out.我已经为这个问题苦苦挣扎了好几个小时，但我似乎无法弄清楚。 I would really appreciate it for any input that would help.对于任何有帮助的输入，我将不胜感激。

Background背景

I am trying to automate data manipulation for my research lab in school through python.我正在尝试通过 python 为我在学校的研究实验室自动化数据操作。 From the experiment, a .csv file containing 41 rows of data excluding header will be produced as seen below.根据实验，将生成一个.csv文件，其中包含除 header 之外的 41 行数据，如下所示。

Sometimes, multiple runs of the same experiment exist and that will produce .csv files with the same header, and taking an average of them is needed for accuracy.有时，同一实验存在多次运行，这将产生具有相同 header 的.csv文件，并且需要对它们取平均值才能获得准确性。 Something like this with the same number of rows and headers:像这样具有相同数量的行和标题的东西：

So far I was able to filter the basenames to only contain the .csv files of the same parameters and have them added to a data frame.到目前为止，我能够过滤基本名称以仅包含相同参数的.csv文件，并将它们添加到数据框中。 However, my issue is that I don't know how to continue to get an average.但是，我的问题是我不知道如何继续获得平均值。

My Current Code and output我当前的代码和 output

Code:代码：

import pandas as pd
import os

dir = "/Users/luke/Desktop/testfolder"

files = os.listdir(dir)
files_of_interests = {}

for filename in files:
    if filename[-4:] == '.csv':
        key = filename[:-5]
        files_of_interests.setdefault(key, [])
        files_of_interests[key].append(filename)

print(files_of_interests)

for key in files_of_interests:
    stack_df = pd.DataFrame()
    print(stack_df)
    for filename in files_of_interests[key]:
        stack_df = stack_df.append(pd.read_csv(os.path.join(dir, filename)))
    print(stack_df)

Output: Output：

Empty DataFrame
Columns: []
Index: []
    Unnamed: 0  Wavelength       S2c  Wavelength.1        S2
0            0        1100  0.000342          1100  0.000304
1            1        1110  0.000452          1110  0.000410
2            2        1120  0.000468          1120  0.000430
3            3        1130  0.000330          1130  0.000306
4            4        1140  0.000345          1140  0.000323
..         ...         ...       ...           ...       ...
36          36        1460  0.002120          1460  0.001773
37          37        1470  0.002065          1470  0.001693
38          38        1480  0.002514          1480  0.002019
39          39        1490  0.002505          1490  0.001967
40          40        1500  0.002461          1500  0.001891

[164 rows x 5 columns]

Question Here!问题在这里！

So my question is, how do I get it to append towards the right individually for each S2c and S2 ?所以我的问题是，如何将每个S2c和S2分别指向右侧的 append ？

Explanation:解释：

With multiple.csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv file which led to the [164 rows x 5 columns] from the previous section. With multiple.csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv file which led to the [164 rows x 5 columns] from the previous section. My original idea is to create a new data frame and only appending S2c and S2 from each of those .csv files such that instead of stacking on top of one another, it will keep appending them as new columns towards the right.我最初的想法是创建一个新的数据框，并且只从每个.csv文件中附加S2c和S2 ，这样它就不会彼此堆叠，而是继续将它们作为新列附加到右侧。 Afterward, I can do some form of pandas column manipulation to have them added and divided by the number of runs (which are just the number of files, so len(files_of_interests[key]) under the second FOR loop).之后，我可以进行某种形式的 pandas 列操作，将它们相加并除以运行次数（这只是文件的数量，因此在第二个 FOR 循环下是len(files_of_interests[key]) ）。

What I have tried我试过的

I have tried creating an empty data frame and adding a column that is taken from np.arange(1100,1500,10) using pd.DataFrame.from_records() .我尝试使用pd.DataFrame.from_records()创建一个空数据框并添加从np.arange(1100,1500,10)获取的列。 And append S2c and S2 to the data frame as I have described from the previous section.和 append S2c和S2到数据帧，正如我在上一节中描述的那样。 The same issue occurred, in addition to that, it produces a bunch of Nan values which I am not too well equipped to deal with even after searching further.发生了同样的问题，除此之外，它还产生了一堆 Nan 值，即使在进一步搜索之后，我也无法很好地处理这些值。
I have read up on multiple other questions posted here, many suggested using pd.concat but since the answers are tailored to a different situation, I can't really replicate it nor do was I able to understand the documentation for it so I stopped pursuing this path.我已经阅读了此处发布的多个其他问题，许多人建议使用pd.concat但由于答案是针对不同情况量身定制的，因此我无法真正复制它，也无法理解它的文档，因此我停止了追求这条路。

Thank you in advance for your help!预先感谢您的帮助！

Additional Info附加信息

I am using macOS and ATOM for the code.我正在使用 macOS 和 ATOM 作为代码。

The csv files can be found here! csv 文件可以在这里找到！

github: https://github.com/teoyi/PROJECT-Automate-Research-Process github: https://github.com/teoyi/PROJECT-Automate-Research-Process

Trying out @zabop method尝试@zabop 方法

Code:代码：

dflist = []
for key in files_of_interests:
    for filename in files_of_interests[key]:
        dflist.append(pd.read_csv(os.path.join(dir, filename)) )
concat = pd.concat(dflist, axis = 1)
concat.to_csv(dir + '/concat.csv')

Output: Output：

Trying @SergeBallesta method尝试@SergeBallesta 方法

Code:代码：

df = pd.concat([pd.read_csv(os.path.join(dir, filename))
                for key in files_of_interests for filename in files_of_interests[key]])

df = df.groupby(['Unnamed: 0', 'Wavelength', 'Wavelength.1']).mean().reset_index()
df.to_csv(dir + '/try.csv')
print(df)

Output: Output：

Answer 1

If you have a list of dataframes, for example:如果您有数据框列表，例如：

import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [3, 1, 2, 0]}
dflist = [pd.DataFrame.from_dict(data) for _ in range(5)]

You can do:你可以做：

pd.concat(dflist,axis=1)

Which will look like:看起来像：

If you want to append each column name with a number indicating which df they came from, before concat , do:如果你想 append 每个列名都带有一个数字，表示它们来自哪个df ，在concat之前，请执行以下操作：

for index, df in enumerate(dflist):
    df.columns = [col+'_'+str(index) for col in df.columns]

Then pd.concat(dflist,axis=1) , resulting:然后pd.concat(dflist,axis=1) ，结果：

While I can't reproduce your file system & confirm that this works, to create the dflist above from you files, something like this should work:虽然我无法重现您的文件系统并确认它是否有效，但要从您的文件创建上面的dflist ，这样的东西应该可以工作：

dflist = []
for key in files_of_interests:
    print(stack_df)
    for filename in files_of_interests[key]:
        dflist.append( pd.read_csv(os.path.join(dir, filename)) )

Answer 2

IIUC you have: IIUC 你有：

a bunch of csv file, each containing a result from the same experiment一堆 csv 文件，每个文件都包含同一实验的结果
the first relevant column always contains numbers from 0 to 40 (so there are 41 lines per file)第一个相关列始终包含从 0 到 40 的数字（因此每个文件有 41 行）
the Wavelenght and Wavelength.1 columns always contain same values from 1100 to 1500 with a 10 increment Wavelenght 和 Wavelength.1 列始终包含从 1100 到 1500 的相同值，增量为 10
but additional columns may exist before the first relevant one但在第一个相关列之前可能存在其他列
the first column has no name in the csv file, and up to the first relevant one names start with 'Unnamed: '第一列在 csv 文件中没有名称，直到第一个相关的名称以'Unnamed: '开头

and you would like to get the average values of the S2 and S2c column for the same Wavelength value.并且您想获得相同波长值的 S2 和 S2c 列的平均值。

This can be done simply with groupby and mean , but we first have to filter out all the unnecessay columns.这可以简单地使用groupby和mean来完成，但我们首先必须过滤掉所有不必要的列。 It can be made through the index_col and usecols parameter of read_csv :可以通过 read_csv 的index_col和usecols参数来read_csv ：

...
print(files_of_interests)

# first concat the datasets:
dfs = [pd.read_csv(os.path.join(dir, filename), index_col=1,
                   usecols=lambda x: not x.startswith('Unnamed: '))
       for key in files_of_interests for filename in files_of_interests[key]]
df = pd.concat(dfs).reset_index()

# then take the averages
df = df.groupby(['Wavelength', 'Wavelength.1']).mean().reset_index()

# reorder columns and add 1 to the index to have it to run from 1 to 41
df = df.reindex(columns=['Wavelength', 'S2c', 'Wavelength.1', 'S2'])
df.index += 1

If there are still unwanted columns in resulting df, this magic command will help to identify the original files having a weird struct:如果生成的 df 中仍有不需要的列，这个神奇的命令将有助于识别具有奇怪结构的原始文件：

import pprint

pprint.pprint([df.columns for df in files])

With the files from github testfolder, it gives:使用 github 测试文件夹中的文件，它给出：

[Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
 Index(['Unnamed: 0', 'Unnamed: 0.1', 'Wavelength', 'S2c', 'Wavelength.1',
       'S2'],
      dtype='object'),
 Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object')]

It makes clear that the fifth file as an additional columns.它清楚地表明第五个文件作为附加列。

Answer 3

Turns out both @zabop and @SergeBallesta have provided me with valuable insights on to work on this issue through pandas.事实证明，@zabop 和 @SergeBallesta 都为我提供了宝贵的见解，可以通过 pandas 解决这个问题。

What I wanted to have:我想要的：

The respective S2c and S2 columns of each file within the key:value pairs to be merged into one .csv file for further manipulation.键：值对中每个文件的相应 S2c 和 S2 列将合并到一个.csv文件中以供进一步操作。
Remove redundant columns to only show a single column of Wavelength that ranges from 1100 to 1500 with an increment of 10.删除冗余列以仅显示Wavelength范围从 1100 到 1500 的单列，增量为 10。

This requires the use of pd.concat which was introduced by @zabop and @SergeBallesta as shown below:这需要使用pd.concat和 @SergeBallesta 引入的 pd.concat ，如下所示：

for key in files_of_interests:
    list = []
    for filename in files_of_interests[key]:
        list.append(pd.read_csv(os.path.join(dir,filename)))
        df = pd.concat(list, axis = 1)
        df = df.drop(['Unnamed: 0', 'Wavelength.1'], axis = 1)
        print(df)
        df.to_csv(os.path.join(dir + '/', f"{filename[:-5]}_master.csv"))

I had to use files_of_interests[key] for it to be able to read the filenames and have pd.read_csv to read the correct path.我必须使用files_of_interests[key]才能读取文件名并让pd.read_csv读取正确的路径。 Other than that, I added axis = 1 to pd.concat which allows it to be concatenated horizontally along with for loops to access the filenames correctly.除此之外，我在pd.concat中添加了axis = 1 ，这允许它与 for 循环一起水平连接以正确访问文件名。 (I have double-checked the values and they do match up with the respective .csv files.) （我已经仔细检查了这些值，它们确实与相应的.csv文件匹配。）

The output to .csv looks like this: output 到.csv看起来像这样：

The only issue now is that groupby as suggested by @SergeBallesta did not work as it returns ValueError: Grouper for 'Wavelength' not 1-dimensional .现在唯一的问题是@SergeBallesta 建议的groupby不起作用，因为它返回ValueError: Grouper for 'Wavelength' not 1-dimensional 。 I will be creating a new question for this if I make no progress by the end of the day.如果我在一天结束时没有取得任何进展，我将为此创建一个新问题。

Once again, a big thank you to @zabop and @SergeBallesta for giving this a try though my explanation wasn't too clear, their answers have definitely provided me with the much-needed insight of how pandas work.再次非常感谢@zabop 和@SergeBallesta 的尝试，尽管我的解释不太清楚，但他们的回答无疑为我提供了关于 pandas 如何工作的急需洞察。

如何通过从多个内容相似的 csv 文件中导入数据来创建 dataframe？

问题描述

3 个解决方案

解决方案1
1 2020-08-01 08:02:42

解决方案2
1 已采纳 2020-08-01 12:45:30

解决方案3
0 2020-08-02 02:42:05

如何通过从多个内容相似的 csv 文件中导入数据来创建 dataframe？

问题描述

3 个解决方案

解决方案1 1 2020-08-01 08:02:42

解决方案2 1 已采纳 2020-08-01 12:45:30

解决方案3 0 2020-08-02 02:42:05

解决方案1
1 2020-08-01 08:02:42

解决方案2
1 已采纳 2020-08-01 12:45:30

解决方案3
0 2020-08-02 02:42:05