[英]'For' loop: creating a new column which takes into account new data from several csv files
I would like to automate a process which assigns labels of several files.我想自动化一个分配多个文件标签的过程。 Accidentally, someone created many files (csv) that look like as follows:无意中,有人创建了许多文件(csv),如下所示:
filename 1: test_1.csv
文件名 1: test_1.csv
Node Target Char1 Var2 Start
1 2 23.1 No 1
1 3 12.4 No 1
1 4 52.1 Yes 1
1 12 14.5 No 1
filename 2: test_2.csv
文件名 2: test_2.csv
Node Target Char1 Var2 Start
1 2 23.1 No 1
1 3 12.4 No 1
1 4 52.1 Yes 1
1 12 14.5 No 1
2 1 23.1 No 0
2 41 12.4 Yes 0
3 15 8.2 No 0
3 12 63.1 No 0
filename 3: test_3.csv
文件名 3: test_3.csv
Node Target Char1 Var2 Start
1 2 23.1 No 1
1 3 12.4 No 1
1 4 52.1 Yes 1
1 12 14.5 No 1
2 1 23.1 No 0
2 41 12.4 Yes 0
3 15 8.2 No 0
3 12 63.1 No 0
41 2 12.4 Yes 0
15 3 8.2 No 0
15 8 12.2 No 0
12 3 63.1 No 0
From what I can see, the csv files are created including data from previous runs.据我所知,创建的 csv 文件包括以前运行的数据。 I would like to add a column which takes into account the dataset where it comes from, without duplicates, ie, just considering what was added in the next dataset.我想添加一个列,它考虑到它来自哪里的数据集,没有重复,即只考虑在下一个数据集中添加的内容。 This would mean, for instance, to have a unique file csv including all data:例如,这意味着拥有一个包含所有数据的唯一文件 csv:
filename ALL: test_all.csv
文件名 ALL: test_all.csv
Node Target Char1 Var2 Start File
1 2 23.1 No 1 1
1 3 12.4 No 1 1
1 4 52.1 Yes 1 1
1 12 14.5 No 1 1
2 1 23.1 No 0 2
2 41 12.4 Yes 0 2
3 15 8.2 No 0 2
3 12 63.1 No 0 2
41 2 12.4 Yes 0 3
15 3 8.2 No 0 3
15 8 12.2 No 0 3
12 3 63.1 No 0 3
I was thinking of calculating the difference between the datasets (in terms of rows) and adding a new column based on that.我正在考虑计算数据集之间的差异(就行而言)并基于此添加一个新列。 However, I am doing this one by one, and this will be not doable since I have, for example:但是,我正在一个一个地做这件事,这将是不可行的,因为我有,例如:
test_1.csv, test_2.csv, test_3.csv, ... , test_7.csv
filex_1.csv, filex_2.csv, ..., filex_7.csv
name_1.csv, name_2.csv, ..., name_7.csv
and so on.等等。
The suffix _x
goes from 1
to 7
: the only change would be in the filename (eg, filex, test, name,
and many many others).后缀_x
从1
到7
:唯一的变化是文件名(例如, filex, test, name,
和许多其他的)。
Can you give me, please, some tips on how to run this in an easier and faster way, for example with a for loop which takes into account the suffix and creates a new column based on new information from each individual file?请你给我一些关于如何以更简单、更快捷的方式运行它的提示,例如使用 for 循环,它考虑后缀并根据每个单独文件的新信息创建一个新列? I will be happy to provide more information and details, if you need.如果您需要,我很乐意提供更多信息和细节。
You can achieve that with pd.concat
and the keys
-argument ( docs ).您可以使用pd.concat
和keys
-argument ( docs ) 来实现。
frames = [df1, df2, ...] # your dataframes
file_names = ['file1', 'file2', ...] # the file names
df = pd.concat(frames, keys=file_names)
Node Target Char1 Var2 Start
file1 0 1 2 23.1 No 1
1 1 3 12.4 No 1
2 1 4 52.1 Yes 1
3 1 12 14.5 No 1
file2 0 1 2 23.1 No 1
1 1 3 12.4 No 1
2 1 4 52.1 Yes 1
3 1 12 14.5 No 1
4 2 1 23.1 No 0
5 2 41 12.4 Yes 0
6 3 15 8.2 No 0
7 3 12 63.1 No 0
file3 0 1 2 23.1 No 1
1 1 3 12.4 No 1
2 1 4 52.1 Yes 1
3 1 12 14.5 No 1
4 2 1 23.1 No 0
5 2 41 12.4 Yes 0
6 3 15 8.2 No 0
7 3 12 63.1 No 0
8 41 2 12.4 Yes 0
9 15 3 8.2 No 0
10 15 8 12.2 No 0
11 12 3 63.1 No 0
To keep duplicates within files, we can temporarily set the level 1 index as column so drop_duplicates
will only match on cross-file-dupes.为了在文件中保留重复项,我们可以临时将 1 级索引设置为列,这样drop_duplicates
将只匹配跨文件重复项。
df = df.reset_index(level=1).drop_duplicates()
# get rid of the extra column
df = df.drop('level_1', axis=1)
# Set the file name index as new column
df = df.reset_index().rename(columns={'index':'File'})
File Node Target Char1 Var2 Start
0 file1 1 2 23.1 No 1
1 file1 1 3 12.4 No 1
2 file1 1 4 52.1 Yes 1
3 file1 1 12 14.5 No 1
4 file2 2 1 23.1 No 0
5 file2 2 41 12.4 Yes 0
6 file2 3 15 8.2 No 0
7 file2 3 12 63.1 No 0
8 file3 41 2 12.4 Yes 0
9 file3 15 3 8.2 No 0
10 file3 15 8 12.2 No 0
11 file3 12 3 63.1 No 0
You can try doing something like this.你可以尝试做这样的事情。
# Importing libraries.
import os # Misc OS interfaces.
import pandas as pd # Data manipulation library.
# Constants.
PATH_DATA_FOLDER = '' # Specify your data folder location.
# Let's get your filenames and only leave unique ones.
list_files = os.listdir(PATH_DATA_FOLDER)
list_filenames = list(pd.unique([file.split('_')[0] for file in list_files]))
# Now, when we have our filenames, we can loop through them, read files and build dataframes.
for filename in list_filenames:
# Get list of columns using the first data file available and append the `File` column.
list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_1.csv')).columns) + ['ID', 'File']
# Create a dataframe which we will fill with data from different data files.
df_final = pd.DataFrame(columns=list_columns)
# Loop through files of same type (test, filex, name...).
# Here we will loop through indices from 1 to 7.
# You might also calculate these values dynamically.
for x in range(1, 8):
# Reading a data file.
df = pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_{}.csv'.format(x)))
# Filling the `File` column with the file index.
df['File'] = x
# Creating an ID column to track duplicates in different files.
df['ID'] = range(0, len(df))
# Appending our final dataframe
df_final = df_final.append(df)
# Resseting the dataframe indices.
# Removing duplicates using the initial columns (without the `File` column).
df_final = df_final.reset_index(drop=True).drop_duplicates(subset=list_columns[0:-1])
# Dropping the unused ID column.
df_final = df_final.drop(['ID'], axis=1)
# Printing out the dataframe.
print(df_final)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.