简体   繁体   English

'For' 循环:创建一个新列,该列考虑来自多个 csv 文件的新数据

[英]'For' loop: creating a new column which takes into account new data from several csv files

I would like to automate a process which assigns labels of several files.我想自动化一个分配多个文件标签的过程。 Accidentally, someone created many files (csv) that look like as follows:无意中,有人创建了许多文件(csv),如下所示:

filename 1: test_1.csv文件名 1: test_1.csv

Node Target Char1 Var2 Start
1      2     23.1  No    1
1      3     12.4  No    1
1      4     52.1  Yes   1
1      12    14.5  No    1

filename 2: test_2.csv文件名 2: test_2.csv

Node Target Char1 Var2 Start
1      2     23.1  No    1
1      3     12.4  No    1
1      4     52.1  Yes   1
1      12    14.5  No    1
2      1     23.1  No    0
2      41    12.4  Yes   0
3      15    8.2   No    0
3      12    63.1  No    0

filename 3: test_3.csv文件名 3: test_3.csv

Node Target Char1 Var2 Start
1      2     23.1  No    1
1      3     12.4  No    1
1      4     52.1  Yes   1
1      12    14.5  No    1
2      1     23.1  No    0
2      41    12.4  Yes   0
3      15    8.2   No    0
3      12    63.1  No    0
41      2     12.4  Yes   0
15      3     8.2   No    0
15      8     12.2  No    0
12      3     63.1  No    0

From what I can see, the csv files are created including data from previous runs.据我所知,创建的 csv 文件包括以前运行的数据。 I would like to add a column which takes into account the dataset where it comes from, without duplicates, ie, just considering what was added in the next dataset.我想添加一个列,它考虑到它来自哪里的数据集,没有重复,即只考虑在下一个数据集中添加的内容。 This would mean, for instance, to have a unique file csv including all data:例如,这意味着拥有一个包含所有数据的唯一文件 csv:

filename ALL: test_all.csv文件名 ALL: test_all.csv

Node Target Char1 Var2 Start  File
1      2     23.1  No    1      1
1      3     12.4  No    1      1
1      4     52.1  Yes   1      1
1      12    14.5  No    1      1
2      1     23.1  No    0      2
2      41    12.4  Yes   0      2
3      15    8.2   No    0      2
3      12    63.1  No    0      2
41      2     12.4  Yes   0      3
15      3     8.2   No    0      3
15      8     12.2  No    0      3
12      3     63.1  No    0      3

I was thinking of calculating the difference between the datasets (in terms of rows) and adding a new column based on that.我正在考虑计算数据集之间的差异(就行而言)并基于此添加一个新列。 However, I am doing this one by one, and this will be not doable since I have, for example:但是,我正在一个一个地做这件事,这将是不可行的,因为我有,例如:

test_1.csv, test_2.csv, test_3.csv, ... , test_7.csv
filex_1.csv, filex_2.csv, ..., filex_7.csv
name_1.csv, name_2.csv, ..., name_7.csv

and so on.等等。

The suffix _x goes from 1 to 7 : the only change would be in the filename (eg, filex, test, name, and many many others).后缀_x17 :唯一的变化是文件名(例如, filex, test, name,和许多其他的)。

Can you give me, please, some tips on how to run this in an easier and faster way, for example with a for loop which takes into account the suffix and creates a new column based on new information from each individual file?请你给我一些关于如何以更简单、更快捷的方式运行它的提示,例如使用 for 循环,它考虑后缀并根据每个单独文件的新信息创建一个新列? I will be happy to provide more information and details, if you need.如果您需要,我很乐意提供更多信息和细节。

You can achieve that with pd.concat and the keys -argument ( docs ).您可以使用pd.concatkeys -argument ( docs ) 来实现。

frames = [df1, df2, ...] # your dataframes
file_names = ['file1', 'file2', ...] # the file names

df = pd.concat(frames, keys=file_names)

Output输出

          Node  Target  Char1 Var2  Start
file1 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
file2 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
      4      2       1   23.1   No      0
      5      2      41   12.4  Yes      0
      6      3      15    8.2   No      0
      7      3      12   63.1   No      0
file3 0      1       2   23.1   No      1
      1      1       3   12.4   No      1
      2      1       4   52.1  Yes      1
      3      1      12   14.5   No      1
      4      2       1   23.1   No      0
      5      2      41   12.4  Yes      0
      6      3      15    8.2   No      0
      7      3      12   63.1   No      0
      8     41       2   12.4  Yes      0
      9     15       3    8.2   No      0
      10    15       8   12.2   No      0
      11    12       3   63.1   No      0

To keep duplicates within files, we can temporarily set the level 1 index as column so drop_duplicates will only match on cross-file-dupes.为了在文件中保留重复项,我们可以临时将 1 级索引设置为列,这样drop_duplicates将只匹配跨文件重复项。

df = df.reset_index(level=1).drop_duplicates()

# get rid of the extra column
df = df.drop('level_1', axis=1)

# Set the file name index as new column
df = df.reset_index().rename(columns={'index':'File'})

Output输出

     File  Node  Target  Char1 Var2  Start
0   file1     1       2   23.1   No      1
1   file1     1       3   12.4   No      1
2   file1     1       4   52.1  Yes      1
3   file1     1      12   14.5   No      1
4   file2     2       1   23.1   No      0
5   file2     2      41   12.4  Yes      0
6   file2     3      15    8.2   No      0
7   file2     3      12   63.1   No      0
8   file3    41       2   12.4  Yes      0
9   file3    15       3    8.2   No      0
10  file3    15       8   12.2   No      0
11  file3    12       3   63.1   No      0

You can try doing something like this.你可以尝试做这样的事情。

# Importing libraries.
import os  # Misc OS interfaces.
import pandas as pd  # Data manipulation library.

# Constants.
PATH_DATA_FOLDER = ''  # Specify your data folder location.

# Let's get your filenames and only leave unique ones.
list_files = os.listdir(PATH_DATA_FOLDER)
list_filenames = list(pd.unique([file.split('_')[0] for file in list_files]))
# Now, when we have our filenames, we can loop through them, read files and build dataframes.
for filename in list_filenames:
    # Get list of columns using the first data file available and append the `File` column.
    list_columns = list(pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_1.csv')).columns) + ['ID', 'File']
    # Create a dataframe which we will fill with data from different data files.
    df_final = pd.DataFrame(columns=list_columns)
    # Loop through files of same type (test, filex, name...).
    # Here we will loop through indices from 1 to 7.
    # You might also calculate these values dynamically. 
    for x in range(1, 8):
        # Reading a data file.
        df = pd.read_csv(os.path.join(PATH_DATA_FOLDER, filename + '_{}.csv'.format(x)))
        # Filling the `File` column with the file index. 
        df['File'] = x
        # Creating an ID column to track duplicates in different files.
        df['ID'] = range(0, len(df))
        # Appending our final dataframe
        df_final = df_final.append(df)
    # Resseting the dataframe indices.
    # Removing duplicates using the initial columns (without the `File` column).
    df_final = df_final.reset_index(drop=True).drop_duplicates(subset=list_columns[0:-1])
    # Dropping the unused ID column.
    df_final = df_final.drop(['ID'], axis=1)
    # Printing out the dataframe.
    print(df_final)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 创建一个循环,在几个 CSV 文件中提取名为“x”的列,并将这些列插入新的 dataframe - Create a loop which extract a column named 'x' in several CSV file and insert the columns in a new dataframe 创建一个新的 pandas 列,该列根据 id 从行中获取值 - Creating a new pandas column which takes values from a row, based on id 从列表中在 CSV 文件中创建新列 - creating a new column in a CSV file from a list 根据两列匹配来自两个csv文件的数据,并使用选定的列创建一个新的csv文件 - Matching data from two csv files based on two columns and creating a new csv file with selected columns Python 中的递归,从每个循环创建几个新实例 - Recursion in Python, creating several new instances from each loop 附加多个 CSV 文件并使用 python 中的文件名创建一个新列 - Appending multiple CSV files and creating a new column with the filename in python 创建一个新的 dataframe 以包含 Python 中多个 csv 文件的 1 列部分 - Creating a new dataframe to contain a section of 1 column from multiple csv files in Python 从许多 csv 文件中选择行并创建新文件 - Selecting lines from many csv files and creating new files 在Python中将新列(及其数据)添加到多个CSV文件中 - Adding a new column (and data to it) into multiple CSV files in Python 将数据框分组并将数据从几列聚合到一个新列中 - Group dataframe and aggregate data from several columns into a new column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM