將多個 *.txt 文件讀入 Pandas Dataframe 文件名作為第一列

Question

目前我有這些代碼，它們只讀取一個特定的 txt 文件並分成不同的列。 每個 txt 文件都存儲在同一目錄中，如下所示：

0 0.712518 0.615250 0.439180 0.206500
1 0.635078 0.811750 0.292786 0.092500

我寫的代碼：

spark.read.format('csv').options(header='false').load("/mnt/datasets/model1/train/labels/2a.txt").toPandas()
df_2a.columns = ['Value']
df_2a_split = df_2a['Value'].str.split(' ', n=0, expand=True)
df_2a_split.columns = ['class','c1','c2','c3','c4']
display(df_2a_split)

而output是這樣的：

class   c1       c2       c3          c4
0   0.712518    0.61525 0.43918     0.2065
1   0.635078    0.81175 0.292786    0.0925

但是，我想攝取目錄中的所有 txt.files，包括文件名作為 pandas dataframe 中的第一列。 預期結果如下所示

file_name class   c1       c2      c3          c4
2a.txt  0   0.712518    0.61525 0.43918     0.2065
2a.txt  1   0.635078    0.81175 0.292786    0.0925
2b.txt  2   0.551273    0.5705  0.30198     0.0922
2b.txt  0   0.550212    0.31125 0.486563    0.2455

Answer 1

import os
import pandas as pd
import spark

directory = '/mnt/datasets/model1/train/labels/'

# Get all the filenames within your directory
files = []
for file in os.listdir(directory):
    if os.path.isfile(os.path.join(directory, file)):
        files.append(file)

# Create an empty df and fill it by looping your files
df = pd.DataFrame()
for file in files:
    df_temp = spark.read.format('csv').options(header='false').load(directory + file).toPandas()
    df_temp.columns = ['file_name', 'Value']
    df_temp = df_temp['Value'].str.split(' ', n=0, expand=True)
    df_temp.columns = ['file_name', 'class','c1','c2','c3','c4']
    df = pd.concat([df, df_temp], ignore_index=True)

# Fill the filename column
df['file_name'] = files

display(df)

將多個 *.txt 文件讀入 Pandas Dataframe 文件名作為第一列

問題描述

1 個解決方案

解決方案1
0 2022-09-19 13:18:59

將多個 *.txt 文件讀入 Pandas Dataframe 文件名作為第一列

問題描述

1 個解決方案

解決方案1 0 2022-09-19 13:18:59

解決方案1
0 2022-09-19 13:18:59