'For' 循环：创建一个新列，该列考虑来自多个 csv 文件的新数据

Question

I would like to automate a process which assigns labels of several files.我想自动化一个分配多个文件标签的过程。 Accidentally, someone created many files (csv) that look like as follows:无意中，有人创建了许多文件（csv），如下所示：

filename 1: test_1.csv文件名 1： test_1.csv

Node Target Char1 Var2 Start
1      2     23.1  No    1
1      3     12.4  No    1
1      4     52.1  Yes   1
1      12    14.5  No    1

filename 2: test_2.csv文件名 2： test_2.csv

Node Target Char1 Var2 Start
1      2     23.1  No    1
1      3     12.4  No    1
1      4     52.1  Yes   1
1      12    14.5  No    1
2      1     23.1  No    0
2      41    12.4  Yes   0
3      15    8.2   No    0
3      12    63.1  No    0

filename 3: test_3.csv文件名 3： test_3.csv

Node Target Char1 Var2 Start
1      2     23.1  No    1
1      3     12.4  No    1
1      4     52.1  Yes   1
1      12    14.5  No    1
2      1     23.1  No    0
2      41    12.4  Yes   0
3      15    8.2   No    0
3      12    63.1  No    0
41      2     12.4  Yes   0
15      3     8.2   No    0
15      8     12.2  No    0
12      3     63.1  No    0

From what I can see, the csv files are created including data from previous runs.据我所知，创建的 csv 文件包括以前运行的数据。 I would like to add a column which takes into account the dataset where it comes from, without duplicates, ie, just considering what was added in the next dataset.我想添加一个列，它考虑到它来自哪里的数据集，没有重复，即只考虑在下一个数据集中添加的内容。 This would mean, for instance, to have a unique file csv including all data:例如，这意味着拥有一个包含所有数据的唯一文件 csv：

filename ALL: test_all.csv文件名 ALL： test_all.csv

Node Target Char1 Var2 Start  File
1      2     23.1  No    1      1
1      3     12.4  No    1      1
1      4     52.1  Yes   1      1
1      12    14.5  No    1      1
2      1     23.1  No    0      2
2      41    12.4  Yes   0      2
3      15    8.2   No    0      2
3      12    63.1  No    0      2
41      2     12.4  Yes   0      3
15      3     8.2   No    0      3
15      8     12.2  No    0      3
12      3     63.1  No    0      3

I was thinking of calculating the difference between the datasets (in terms of rows) and adding a new column based on that.我正在考虑计算数据集之间的差异（就行而言）并基于此添加一个新列。 However, I am doing this one by one, and this will be not doable since I have, for example:但是，我正在一个一个地做这件事，这将是不可行的，因为我有，例如：

test_1.csv, test_2.csv, test_3.csv, ... , test_7.csv
filex_1.csv, filex_2.csv, ..., filex_7.csv
name_1.csv, name_2.csv, ..., name_7.csv

and so on.等等。

The suffix _x goes from 1 to 7 : the only change would be in the filename (eg, filex, test, name, and many many others).后缀_x从1到7 ：唯一的变化是文件名（例如， filex, test, name,和许多其他的）。

Can you give me, please, some tips on how to run this in an easier and faster way, for example with a for loop which takes into account the suffix and creates a new column based on new information from each individual file?请你给我一些关于如何以更简单、更快捷的方式运行它的提示，例如使用 for 循环，它考虑后缀并根据每个单独文件的新信息创建一个新列？ I will be happy to provide more information and details, if you need.如果您需要，我很乐意提供更多信息和细节。

Answer 1

You can achieve that with pd.concat and the keys -argument ( docs ).您可以使用pd.concat和keys -argument ( docs ) 来实现。

frames = [df1, df2, ...] # your dataframes
file_names = ['file1', 'file2', ...] # the file names

df = pd.concat(frames, keys=file_names)

Output输出

          Node  Target  Char1 Var2  Start
file1 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
file2 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
      4      2       1   23.1   No      0
      5      2      41   12.4  Yes      0
      6      3      15    8.2   No      0
      7      3      12   63.1   No      0
file3 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
      4      2       1   23.1   No      0
      5      2      41   12.4  Yes      0
      6      3      15    8.2   No      0
      7      3      12   63.1   No      0
      8     41       2   12.4  Yes      0
      9     15       3    8.2   No      0
      10    15       8   12.2   No      0
      11    12       3   63.1   No      0

To keep duplicates within files, we can temporarily set the level 1 index as column so drop_duplicates will only match on cross-file-dupes.为了在文件中保留重复项，我们可以临时将 1 级索引设置为列，这样drop_duplicates将只匹配跨文件重复项。

df = df.reset_index(level=1).drop_duplicates()

# get rid of the extra column
df = df.drop('level_1', axis=1)

# Set the file name index as new column
df = df.reset_index().rename(columns={'index':'File'})

Output输出

     File  Node  Target  Char1 Var2  Start
0   file1     1       2   23.1   No      1
1   file1     1       3   12.4   No      1
2   file1     1       4   52.1  Yes      1
3   file1     1      12   14.5   No      1
4   file2     2       1   23.1   No      0
5   file2     2      41   12.4  Yes      0
6   file2     3      15    8.2   No      0
7   file2     3      12   63.1   No      0
8   file3    41       2   12.4  Yes      0
9   file3    15       3    8.2   No      0
10  file3    15       8   12.2   No      0
11  file3    12       3   63.1   No      0

Answer 2

You can try doing something like this.你可以尝试做这样的事情。

# Importing libraries.
import os  # Misc OS interfaces.
import pandas as pd  # Data manipulation library.

# Constants.
PATH_DATA_FOLDER = ''  # Specify your data folder location.

# Let's get your filenames and only leave unique ones.
list_files = os.listdir(PATH_DATA_FOLDER)
list_filenames = list(pd.unique([file.split('_')[0] for file in list_files]))
# Now, when we have our filenames, we can loop through them, read files and build dataframes.
for filename in list_filenames:
    # Get list of columns using the first data file available and append the `File` column.
    list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_1.csv')).columns) + ['ID', 'File']
    # Create a dataframe which we will fill with data from different data files.
    df_final = pd.DataFrame(columns=list_columns)
    # Loop through files of same type (test, filex, name...).
    # Here we will loop through indices from 1 to 7.
    # You might also calculate these values dynamically. 
    for x in range(1, 8):
        # Reading a data file.
        df = pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_{}.csv'.format(x)))
        # Filling the `File` column with the file index. 
        df['File'] = x
        # Creating an ID column to track duplicates in different files.
        df['ID'] = range(0, len(df))
        # Appending our final dataframe
        df_final = df_final.append(df)
    # Resseting the dataframe indices.
    # Removing duplicates using the initial columns (without the `File` column).
    df_final = df_final.reset_index(drop=True).drop_duplicates(subset=list_columns[0:-1])
    # Dropping the unused ID column.
    df_final = df_final.drop(['ID'], axis=1)
    # Printing out the dataframe.
    print(df_final)

'For' 循环：创建一个新列，该列考虑来自多个 csv 文件的新数据

问题描述

2 个解决方案

解决方案1
1 2021-07-27 12:14:40

Output输出

Output输出

解决方案2
1 已采纳 2021-07-27 12:26:35

&#39;For&#39; 循环：创建一个新列，该列考虑来自多个 csv 文件的新数据

问题描述

2 个解决方案

解决方案1 1 2021-07-27 12:14:40

Output输出

Output输出

解决方案2 1 已采纳 2021-07-27 12:26:35

'For' 循环：创建一个新列，该列考虑来自多个 csv 文件的新数据

解决方案1
1 2021-07-27 12:14:40

解决方案2
1 已采纳 2021-07-27 12:26:35