Currently I have these codes which only reads one specific txt file and split into different columns. Each txt file is stored in the same directory and looks like this:
0 0.712518 0.615250 0.439180 0.206500
1 0.635078 0.811750 0.292786 0.092500
The code I wrote:
spark.read.format('csv').options(header='false').load("/mnt/datasets/model1/train/labels/2a.txt").toPandas()
df_2a.columns = ['Value']
df_2a_split = df_2a['Value'].str.split(' ', n=0, expand=True)
df_2a_split.columns = ['class','c1','c2','c3','c4']
display(df_2a_split)
And the output is like this:
class c1 c2 c3 c4
0 0.712518 0.61525 0.43918 0.2065
1 0.635078 0.81175 0.292786 0.0925
However, I want to ingest all txt.files in a directory including the filename as the first column in the pandas dataframe. The expected result looks like below
file_name class c1 c2 c3 c4
2a.txt 0 0.712518 0.61525 0.43918 0.2065
2a.txt 1 0.635078 0.81175 0.292786 0.0925
2b.txt 2 0.551273 0.5705 0.30198 0.0922
2b.txt 0 0.550212 0.31125 0.486563 0.2455
import os
import pandas as pd
import spark
directory = '/mnt/datasets/model1/train/labels/'
# Get all the filenames within your directory
files = []
for file in os.listdir(directory):
if os.path.isfile(os.path.join(directory, file)):
files.append(file)
# Create an empty df and fill it by looping your files
df = pd.DataFrame()
for file in files:
df_temp = spark.read.format('csv').options(header='false').load(directory + file).toPandas()
df_temp.columns = ['file_name', 'Value']
df_temp = df_temp['Value'].str.split(' ', n=0, expand=True)
df_temp.columns = ['file_name', 'class','c1','c2','c3','c4']
df = pd.concat([df, df_temp], ignore_index=True)
# Fill the filename column
df['file_name'] = files
display(df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.