[英]Concatenating CSVs into dataframe with filename column
I am trying to concat multiple CSVs that live in subfolders of my parent directory into a data frame, while also adding a new filename column.我正在尝试将位于我父目录的子文件夹中的多个 CSV 连接到一个数据框中,同时还添加一个新的文件名列。
/ParentDirectory
│
│
├───SubFolder 1
│ test1.csv
│
├───SubFolder 2
│ test2.csv
│
├───SubFolder 3
│ test3.csv
│ test4.csv
│
├───SubFolder 4
│ test5.csv
I can do something like this to concat all the CSVs into a single data frame我可以做这样的事情来将所有 CSV 连接到一个数据框中
import pandas as pd
import glob
files = glob.glob('/ParentDirectory/**/*.csv', recursive=True)
df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)
But is there a way to also add the filename of each file as a column to the final data frame, or do I have to loop through each individual file first before concatenating the final data frame?但是有没有办法将每个文件的文件名作为一列添加到最终数据框,或者我是否必须在连接最终数据框之前先遍历每个单独的文件? Output should look like:
Output 应如下所示:
Col1 Col2 file_name
0 AAAA XYZ test1.csv
1 BBBB XYZ test1.csv
2 CCCC RST test1.csv
3 DDDD XYZ test2.csv
4 AAAA WXY test3.csv
5 CCCC RST test4.csv
6 DDDD XTZ test4.csv
7 AAAA TTT test4.csv
8 CCCC RRR test4.csv
9 AAAA QQQ test4.csv
you can assign the file_names on the fly:您可以即时分配文件名:
from pathlib import Path
df = pd.concat([pd.read_csv(fp).assign(file_name=Path(fp).name)
for fp in files], ignore_index=True)
where pathlib.Path helps to extract the basename of the file from the path.其中 pathlib.Path 有助于从路径中提取文件的基本名称。
A possible solution (you may need to replace /
in the code below by the appropriate slash for your operating system):一个可能的解决方案(您可能需要用适合您操作系统的斜杠替换下面代码中的
/
):
df = pd.concat([pd.read_csv(fp).assign(file_name=str.rsplit(
fp, '/', 1)[-1]) for fp in files], ignore_index=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.