简体   繁体   English

Append 文件名到 pandas 中的 csv 文件

[英]Append filename to csv file in pandas

I'm trying to append the file name of my CSV files as a column name in those CSV files and I have the basic idea and code of how to do it, just can't integrate it in my current code.我正在尝试将 append 我的 CSV 文件的文件名作为列名在那些 CSV 文件中,我有基本的想法和代码如何在我的当前代码中集成。 Its probably very easy.它可能很容易。

This is how I'm reading my CSV files and appending them in a dataframe这就是我阅读 CSV 文件并将它们附加到 dataframe 中的方式

big_frame = pd.concat([pd.read_csv(f, skiprows=0 , header=None , index_col= False ,names=col_Names) for f in glob.glob('filepath' + "/*.csv")],
                      ignore_index=True)

and I know I just need to add these two lines somewhere in the code我知道我只需要在代码中的某处添加这两行

frame['filename'] = os.path.basename(f)
f.append(frame)

any help?有什么帮助吗?

For example, I have 3 CSV files, each with the same column names as shown below.例如,我有 3 个 CSV 文件,每个文件都有相同的列名,如下所示。

Column A Column B Column C 

I want to concatenate them all in a big data frame with a new column that has there original CSV file name like我想在一个大数据框中将它们全部连接到一个新列中,该列具有原始 CSV 文件名,例如

Column A Column B Column C filename
                            file 1
                            file 2
                            file 3

You can use df.assign and you can open files using Path.glob from pathlib module.您可以使用df.assign并且可以使用pathlib模块中的Path.glob打开文件。

from pathlib import Path

big_frame = pd.concat(
    [pd.read_csv(file.name, skiprows=0, header=None, index_col=False, names=col_Names).assign(filname=file.name)
     for file in Path('filepath').glob('*.csv')],
    ignore_index=True)

Use DataFrame.assign() after read_csv to add a column as soon as it's read:在 read_csv 之后使用read_csv DataFrame.assign()在读取后立即添加一列:

big_frame = pd.concat([pd.read_csv(f, ...).assign(filename=os.path.basename(f))
                       for f in glob.glob('filepath' + "/*.csv")],
                      ignore_index=True)

(The ... refers to all the other paramters to read_csv .) ...指的是read_csv的所有其他参数。)

Other changes:其他变化:

  1. pd.concat() accepts a generator so you don't need to create a list of df's with the list comprehension. pd.concat()接受生成器,因此您无需使用列表理解创建 df 列表。 It just uses more memory than needed and since you're reading off disk, provides no performance improvement.它只是使用了比需要更多的 memory 并且由于您正在读取磁盘,因此没有提供性能改进。 And when you use a generator expression, it will need extra parentheses.当你使用生成器表达式时,它需要额外的括号。 Note the extra indent for readability:请注意额外的缩进以提高可读性:

     big_frame = pd.concat((pd.read_csv(f, ...).assign(filename=os.path.basename(f)) for f in glob.glob('filepath' + "/*.csv")), ignore_index=True)
  2. For the globbing, use os.path.join (since filepath is a variable name and not the actual path:对于通配符,请使用 os.path.join (因为filepath是变量名而不是实际路径:

     glob.glob(os.path.join(filepath, '*.csv'))

    Or use pathlib.Path and Path.glob as in deadshots's answer.或者使用pathlib.PathPath.glob ,就像在 deadshots 的回答中一样。

With all the params:使用所有参数:

big_frame = pd.concat((pd.read_csv(f,
                                   skiprows=0,
                                   header=None,
                                   index_col=False,
                                   names=col_Names,
                        ).assign(filename=os.path.basename(f))
                            for f in glob.glob(os.path.join(filepath, '*.csv'))
                       ),
                       ignore_index=True)

Btw, I do this when mass reading CSV's, except I don't use only the basename because I want the full path to the file included.顺便说一句,我在大量读取 CSV 时这样做,除了我不只使用基本名称,因为我想要包含文件的完整路径。 Especially useful when reading same-format CSV's from different sources/directories.在从不同的源/目录读取相同格式的 CSV 时特别有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM