简体   繁体   English

读取多个csv文件并将文件名添加为pandas中的新列

[英]Read multiple csv files and Add filename as new column in pandas

I have several csv files in a single folder and I want to open them all in one dataframe and insert a new column with the associated filename.我在一个文件夹中有几个 csv 文件,我想在一个数据框中打开它们,并插入一个具有关联文件名的新列。 So far I've coded the following:到目前为止,我已经编写了以下代码:

import pandas as pd
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('path/*.csv'))))
df['filename']= os.path.basename(csv)
df

This gives me the dataframe I want but in the new column 'filename' it's only listing the last filename in the folder for every row.这给了我我想要的数据框,但在新列“文件名”中,它只列出文件夹中每一行的最后一个文件名。 I'm looking for each row to be populated with it's associated csv file.我正在寻找要填充其关联的 csv 文件的每一行。 Not just the last file in the folder.不仅仅是文件夹中的最后一个文件。

Any assistance for this newbie is much appreciated.非常感谢对这个新手的任何帮助。

I think you need assign for add new column in loop , also parameter ignore_index=True was added to concat for remove duplicates in index :我认为您需要assign以在loop添加新列,还将参数ignore_index=True添加到concat以删除index中的重复项:

Files for test are a.csv , b.csv , c.csv .测试文件为a.csvb.csvc.csv

import pandas as pd
import glob, os


files = glob.glob('samples_for_so/*.csv')
print (files)
#['samples_for_so\\a.csv', 'samples_for_so\\b.csv', 'samples_for_so\\c.csv']


df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
print (df)
   a  b  c  d    New
0  0  1  2  5  a.csv
1  1  5  8  3  a.csv
0  0  9  6  5  b.csv
1  1  6  4  2  b.csv
0  0  7  1  7  c.csv
1  1  3  2  6  c.csv

files = glob.glob('samples_for_so/*.csv')
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) 
       for fp in files])
print (df)
   a  b  c  d New
0  0  1  2  5   a
1  1  5  8  3   a
2  0  9  6  5   b
3  1  6  4  2   b
4  0  7  1  7   c
5  1  3  2  6   c

Firstly, you have no csv variable defined.首先,您没有定义 csv 变量。

But anyway, this behaviour makes sense, because you are using the csv at the end so it'll be set to the last file.但无论如何,这种行为是有道理的,因为您在最后使用了 csv,因此它将被设置为最后一个文件。 Ideally, you can use glob again to get all filenames, then set that as a new column.理想情况下,您可以再次使用 glob 获取所有文件名,然后将其设置为新列。

#this is a Python list containing filenames
csvs = glob.glob(os.path.join('path/*.csv'))

#now set the csv into a pd series
csv_paths = pd.Series(csvs)

df['file_name'] = csv_paths.values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM