简体   繁体   English

从 csv 文件中读取单列并用文本文件的名称重命名

[英]Read single column from csv file and rename with the name of the text file

I'm using a for loop to cycle through numerous text files, select a single column from the text files (named ppm), and append these columns to a new data frame.我正在使用 for 循环遍历大量文本文件,select 来自文本文件的单个列(名为 ppm),以及 append 这些列到新的数据框。 I'd like the columns in the new data frame to have the name of the text file but I'm not sure how to do this..我希望新数据框中的列具有文本文件的名称,但我不确定如何执行此操作..

My code is:我的代码是:

all_files=glob.glob(os.path.join(path,"*.txt"))
df1=pd.DataFrame()
for file in all_files:
    file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
    df1 = pd.concat([df,df1],axis=1)

At the moment every column in the new dataframe is called 'ppm'.目前,新 dataframe 中的每一列都称为“ppm”。

I used to have this code我曾经有这个代码

df1=pd.DataFrame()
for file in all_files:
    file_name = file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s+', header = 0)
    df1[file_name] = df['ppm']

But I ran into the warning 'PerformanceWarning: DataFrame is highly fragmented.但是我遇到了警告“PerformanceWarning:DataFrame 高度分散。 This is usually the result of calling frame.insert many times, which has poor performance.这通常是多次调用frame.insert的结果,性能很差。 Consider joining all columns at once using pd.concat(axis=1) instead.考虑使用 pd.concat(axis=1) 一次连接所有列。 To get a de-fragmented frame, use newframe = frame.copy() df1[file_name] = df['ppm'].copy()' when I tried to run the code for a large number of files (~ 100s).要获得碎片整理的帧,请在我尝试为大量文件(约 100 秒)运行代码时使用 newframe = frame.copy() df1[file_name] = df['ppm'].copy()'。

Use concat outside loops with append DataFrames to list with rename column ppm :使用带有 append DataFrames 的concat外部循环来列出重命名列ppm

all_files=glob.glob(os.path.join(path,"*.txt"))

dfs = []
for file in all_files:
    file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
    dfs.append(df.rename(columns={'ppm':file_name}))
df_big = pd.concat(dfs, axis=1)

Assuming index is equal, add all your data into a dictionairy:假设 index 相等,将所有数据添加到字典中:

all_files=glob.glob(os.path.join(path,"*.txt"))
data_dict = {}
for file in all_files:
    file_name = os.path.basename(file)
    df = pd.read_csv(file, index_col=None, sep='\s+', header = 0, usecols = ['ppm'])
    data_dict[file_name] = df['ppm']
    
df1 = pd.DataFrame(data_dict)

Use df.rename() to rename the column name of the dataframe.使用df.rename()重命名 dataframe 的列名。

for file in all_files:
    file_name = os.path.basename(file)
    print(file_name)
    df = pandas.read_csv(file, index_col=None, sep=',', header = 0, usecols = ['ppm'])
    df.rename(columns={'ppm': file_name}, inplace=True)
    df1 = pandas.concat([df,df1],axis=1)

Output: Output:

  two.txt one.txt
0   9   3
1   0   6

Rather than concatenating and appending dataframes as you iterate over your list of files, you could consider building a dictionary of the relevant data then construct your dataframe just once.与其在遍历文件列表时连接和附加数据帧,不如考虑构建相关数据的字典,然后只构建一次 dataframe。 Like this:像这样:

import csv
import pandas as pd
import glob
import os

PATH = ''
COL = 'ppm'
FILENAME = 'filename'
D = {COL: [], FILENAME: []}
for file in glob.glob(os.path.join(PATH, '*.csv')):
    with open(file, newline='') as infile:
        for row in csv.DictReader(infile):
            if COL in row:
                D[COL].append(row[COL])
                D[FILENAME].append(file)

df = pd.DataFrame(D)
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM