[英]How to create a dataframe by importing data from multiple .csv files that are alike in contents?
I have been struggling with this issue for hours now and I can't seem to figure it out.我已经为这个问题苦苦挣扎了好几个小时,但我似乎无法弄清楚。 I would really appreciate it for any input that would help.
对于任何有帮助的输入,我将不胜感激。
Background背景
I am trying to automate data manipulation for my research lab in school through python.我正在尝试通过 python 为我在学校的研究实验室自动化数据操作。 From the experiment, a
.csv
file containing 41 rows of data excluding header will be produced as seen below.根据实验,将生成一个
.csv
文件,其中包含除 header 之外的 41 行数据,如下所示。
Sometimes, multiple runs of the same experiment exist and that will produce .csv
files with the same header, and taking an average of them is needed for accuracy.有时,同一实验存在多次运行,这将产生具有相同 header 的
.csv
文件,并且需要对它们取平均值才能获得准确性。 Something like this with the same number of rows and headers:像这样具有相同数量的行和标题的东西:
So far I was able to filter the basenames to only contain the .csv
files of the same parameters and have them added to a data frame.到目前为止,我能够过滤基本名称以仅包含相同参数的
.csv
文件,并将它们添加到数据框中。 However, my issue is that I don't know how to continue to get an average.但是,我的问题是我不知道如何继续获得平均值。
My Current Code and output我当前的代码和 output
Code:代码:
import pandas as pd
import os
dir = "/Users/luke/Desktop/testfolder"
files = os.listdir(dir)
files_of_interests = {}
for filename in files:
if filename[-4:] == '.csv':
key = filename[:-5]
files_of_interests.setdefault(key, [])
files_of_interests[key].append(filename)
print(files_of_interests)
for key in files_of_interests:
stack_df = pd.DataFrame()
print(stack_df)
for filename in files_of_interests[key]:
stack_df = stack_df.append(pd.read_csv(os.path.join(dir, filename)))
print(stack_df)
Output: Output:
Empty DataFrame
Columns: []
Index: []
Unnamed: 0 Wavelength S2c Wavelength.1 S2
0 0 1100 0.000342 1100 0.000304
1 1 1110 0.000452 1110 0.000410
2 2 1120 0.000468 1120 0.000430
3 3 1130 0.000330 1130 0.000306
4 4 1140 0.000345 1140 0.000323
.. ... ... ... ... ...
36 36 1460 0.002120 1460 0.001773
37 37 1470 0.002065 1470 0.001693
38 38 1480 0.002514 1480 0.002019
39 39 1490 0.002505 1490 0.001967
40 40 1500 0.002461 1500 0.001891
[164 rows x 5 columns]
Question Here!问题在这里!
So my question is, how do I get it to append towards the right individually for each S2c
and S2
?所以我的问题是,如何将每个
S2c
和S2
分别指向右侧的 append ?
Explanation:解释:
With multiple.csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous .csv
file which led to the [164 rows x 5 columns]
from the previous section. With multiple.csv files with the same header names, when I append it to the list it just keeps stacking towards the bottom of the previous
.csv
file which led to the [164 rows x 5 columns]
from the previous section. My original idea is to create a new data frame and only appending S2c
and S2
from each of those .csv
files such that instead of stacking on top of one another, it will keep appending them as new columns towards the right.我最初的想法是创建一个新的数据框,并且只从每个
.csv
文件中附加S2c
和S2
,这样它就不会彼此堆叠,而是继续将它们作为新列附加到右侧。 Afterward, I can do some form of pandas column manipulation to have them added and divided by the number of runs (which are just the number of files, so len(files_of_interests[key])
under the second FOR loop).之后,我可以进行某种形式的 pandas 列操作,将它们相加并除以运行次数(这只是文件的数量,因此在第二个 FOR 循环下是
len(files_of_interests[key])
)。
What I have tried我试过的
I have tried creating an empty data frame and adding a column that is taken from np.arange(1100,1500,10)
using pd.DataFrame.from_records()
.我尝试使用
pd.DataFrame.from_records()
创建一个空数据框并添加从np.arange(1100,1500,10)
获取的列。 And append S2c
and S2
to the data frame as I have described from the previous section.和 append
S2c
和S2
到数据帧,正如我在上一节中描述的那样。 The same issue occurred, in addition to that, it produces a bunch of Nan values which I am not too well equipped to deal with even after searching further.发生了同样的问题,除此之外,它还产生了一堆 Nan 值,即使在进一步搜索之后,我也无法很好地处理这些值。
I have read up on multiple other questions posted here, many suggested using pd.concat
but since the answers are tailored to a different situation, I can't really replicate it nor do was I able to understand the documentation for it so I stopped pursuing this path.我已经阅读了此处发布的多个其他问题,许多人建议使用
pd.concat
但由于答案是针对不同情况量身定制的,因此我无法真正复制它,也无法理解它的文档,因此我停止了追求这条路。
Thank you in advance for your help!预先感谢您的帮助!
Additional Info附加信息
I am using macOS and ATOM for the code.我正在使用 macOS 和 ATOM 作为代码。
The csv files can be found here! csv 文件可以在这里找到!
github: https://github.com/teoyi/PROJECT-Automate-Research-Process github: https://github.com/teoyi/PROJECT-Automate-Research-Process
Trying out @zabop method尝试@zabop 方法
Code:代码:
dflist = []
for key in files_of_interests:
for filename in files_of_interests[key]:
dflist.append(pd.read_csv(os.path.join(dir, filename)) )
concat = pd.concat(dflist, axis = 1)
concat.to_csv(dir + '/concat.csv')
Output: Output:
Trying @SergeBallesta method尝试@SergeBallesta 方法
Code:代码:
df = pd.concat([pd.read_csv(os.path.join(dir, filename))
for key in files_of_interests for filename in files_of_interests[key]])
df = df.groupby(['Unnamed: 0', 'Wavelength', 'Wavelength.1']).mean().reset_index()
df.to_csv(dir + '/try.csv')
print(df)
Output: Output:
If you have a list of dataframes, for example:如果您有数据框列表,例如:
import pandas as pd
data = {'col_1': [3, 2, 1, 0], 'col_2': [3, 1, 2, 0]}
dflist = [pd.DataFrame.from_dict(data) for _ in range(5)]
You can do:你可以做:
pd.concat(dflist,axis=1)
Which will look like:看起来像:
If you want to append each column name with a number indicating which df
they came from, before concat
, do:如果你想 append 每个列名都带有一个数字,表示它们来自哪个
df
,在concat
之前,请执行以下操作:
for index, df in enumerate(dflist):
df.columns = [col+'_'+str(index) for col in df.columns]
Then pd.concat(dflist,axis=1)
, resulting:然后
pd.concat(dflist,axis=1)
,结果:
While I can't reproduce your file system & confirm that this works, to create the dflist
above from you files, something like this should work:虽然我无法重现您的文件系统并确认它是否有效,但要从您的文件创建上面的
dflist
,这样的东西应该可以工作:
dflist = []
for key in files_of_interests:
print(stack_df)
for filename in files_of_interests[key]:
dflist.append( pd.read_csv(os.path.join(dir, filename)) )
IIUC you have: IIUC 你有:
'Unnamed: '
'Unnamed: '
开头and you would like to get the average values of the S2 and S2c column for the same Wavelength value.并且您想获得相同波长值的 S2 和 S2c 列的平均值。
This can be done simply with groupby
and mean
, but we first have to filter out all the unnecessay columns.这可以简单地使用
groupby
和mean
来完成,但我们首先必须过滤掉所有不必要的列。 It can be made through the index_col
and usecols
parameter of read_csv
:可以通过 read_csv 的
index_col
和usecols
参数来read_csv
:
...
print(files_of_interests)
# first concat the datasets:
dfs = [pd.read_csv(os.path.join(dir, filename), index_col=1,
usecols=lambda x: not x.startswith('Unnamed: '))
for key in files_of_interests for filename in files_of_interests[key]]
df = pd.concat(dfs).reset_index()
# then take the averages
df = df.groupby(['Wavelength', 'Wavelength.1']).mean().reset_index()
# reorder columns and add 1 to the index to have it to run from 1 to 41
df = df.reindex(columns=['Wavelength', 'S2c', 'Wavelength.1', 'S2'])
df.index += 1
If there are still unwanted columns in resulting df, this magic command will help to identify the original files having a weird struct:如果生成的 df 中仍有不需要的列,这个神奇的命令将有助于识别具有奇怪结构的原始文件:
import pprint
pprint.pprint([df.columns for df in files])
With the files from github testfolder, it gives:使用 github 测试文件夹中的文件,它给出:
[Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object'),
Index(['Unnamed: 0', 'Unnamed: 0.1', 'Wavelength', 'S2c', 'Wavelength.1',
'S2'],
dtype='object'),
Index(['Unnamed: 0', 'Wavelength', 'S2c', 'Wavelength.1', 'S2'], dtype='object')]
It makes clear that the fifth file as an additional columns.它清楚地表明第五个文件作为附加列。
Turns out both @zabop and @SergeBallesta have provided me with valuable insights on to work on this issue through pandas.事实证明,@zabop 和 @SergeBallesta 都为我提供了宝贵的见解,可以通过 pandas 解决这个问题。
What I wanted to have:我想要的:
The respective S2c and S2 columns of each file within the key:value pairs to be merged into one .csv
file for further manipulation.键:值对中每个文件的相应 S2c 和 S2 列将合并到一个
.csv
文件中以供进一步操作。
Remove redundant columns to only show a single column of Wavelength
that ranges from 1100 to 1500 with an increment of 10.删除冗余列以仅显示
Wavelength
范围从 1100 到 1500 的单列,增量为 10。
This requires the use of pd.concat
which was introduced by @zabop and @SergeBallesta as shown below:这需要使用
pd.concat
和 @SergeBallesta 引入的 pd.concat ,如下所示:
for key in files_of_interests:
list = []
for filename in files_of_interests[key]:
list.append(pd.read_csv(os.path.join(dir,filename)))
df = pd.concat(list, axis = 1)
df = df.drop(['Unnamed: 0', 'Wavelength.1'], axis = 1)
print(df)
df.to_csv(os.path.join(dir + '/', f"{filename[:-5]}_master.csv"))
I had to use files_of_interests[key]
for it to be able to read the filenames and have pd.read_csv
to read the correct path.我必须使用
files_of_interests[key]
才能读取文件名并让pd.read_csv
读取正确的路径。 Other than that, I added axis = 1
to pd.concat
which allows it to be concatenated horizontally along with for loops to access the filenames correctly.除此之外,我在
pd.concat
中添加了axis = 1
,这允许它与 for 循环一起水平连接以正确访问文件名。 (I have double-checked the values and they do match up with the respective .csv
files.) (我已经仔细检查了这些值,它们确实与相应的
.csv
文件匹配。)
The output to .csv
looks like this: output 到
.csv
看起来像这样:
The only issue now is that groupby
as suggested by @SergeBallesta did not work as it returns ValueError: Grouper for 'Wavelength' not 1-dimensional
.现在唯一的问题是@SergeBallesta 建议的
groupby
不起作用,因为它返回ValueError: Grouper for 'Wavelength' not 1-dimensional
。 I will be creating a new question for this if I make no progress by the end of the day.如果我在一天结束时没有取得任何进展,我将为此创建一个新问题。
Once again, a big thank you to @zabop and @SergeBallesta for giving this a try though my explanation wasn't too clear, their answers have definitely provided me with the much-needed insight of how pandas work.再次非常感谢@zabop 和@SergeBallesta 的尝试,尽管我的解释不太清楚,但他们的回答无疑为我提供了关于 pandas 如何工作的急需洞察。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.