使用文件名列将 CSV 连接到 dataframe

Question

I am trying to concat multiple CSVs that live in subfolders of my parent directory into a data frame, while also adding a new filename column.我正在尝试将位于我父目录的子文件夹中的多个 CSV 连接到一个数据框中，同时还添加一个新的文件名列。

/ParentDirectory
│  
│
├───SubFolder 1
│       test1.csv
│
├───SubFolder 2
│       test2.csv
│
├───SubFolder 3
│       test3.csv
│       test4.csv
│
├───SubFolder 4
│       test5.csv

I can do something like this to concat all the CSVs into a single data frame我可以做这样的事情来将所有 CSV 连接到一个数据框中

import pandas as pd
import glob

files = glob.glob('/ParentDirectory/**/*.csv', recursive=True)
df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)

But is there a way to also add the filename of each file as a column to the final data frame, or do I have to loop through each individual file first before concatenating the final data frame?但是有没有办法将每个文件的文件名作为一列添加到最终数据框，或者我是否必须在连接最终数据框之前先遍历每个单独的文件？ Output should look like: Output 应如下所示：

   Col1  Col2    file_name
0  AAAA   XYZ    test1.csv
1  BBBB   XYZ    test1.csv
2  CCCC   RST    test1.csv
3  DDDD   XYZ    test2.csv
4  AAAA   WXY    test3.csv
5  CCCC   RST    test4.csv
6  DDDD   XTZ    test4.csv
7  AAAA   TTT    test4.csv
8  CCCC   RRR    test4.csv
9  AAAA   QQQ    test4.csv

Answer 1

you can assign the file_names on the fly:您可以即时分配文件名：

from pathlib import Path

df = pd.concat([pd.read_csv(fp).assign(file_name=Path(fp).name)
                for fp in files], ignore_index=True)

where pathlib.Path helps to extract the basename of the file from the path.其中 pathlib.Path 有助于从路径中提取文件的基本名称。

Answer 2

A possible solution (you may need to replace / in the code below by the appropriate slash for your operating system):一个可能的解决方案（您可能需要用适合您操作系统的斜杠替换下面代码中的/ ）：

df = pd.concat([pd.read_csv(fp).assign(file_name=str.rsplit(
    fp, '/', 1)[-1]) for fp in files], ignore_index=True)

使用文件名列将 CSV 连接到 dataframe

问题描述

2 个解决方案

解决方案1
1 已采纳 2023-01-23 16:54:10

解决方案2
0 2023-01-23 17:10:02

使用文件名列将 CSV 连接到 dataframe

问题描述

2 个解决方案

解决方案1 1 已采纳 2023-01-23 16:54:10

解决方案2 0 2023-01-23 17:10:02

解决方案1
1 已采纳 2023-01-23 16:54:10

解决方案2
0 2023-01-23 17:10:02