简体   繁体   English

在没有for循环的情况下将计算出的列添加到Panel内的每个DataFrame中

[英]Add calculated columns to each DataFrame inside a Panel without for-loop

I have ~300 .csv files all with the same number of rows and columns for instrumentation data. 我有约300个.csv文件,它们具有相同数量的行和列用于检测数据。 Since each .csv file represents a day and the structure is the same, I figured it would be best to pull each .csv into a Pandas DataFrame and then throw them into a Panel object to perform faster calculations. 由于每个.csv文件都代表一天,并且结构相同,因此我认为最好将每个.csv文件放入Pandas DataFrame中,然后将其放入Panel对象中以执行更快的计算。

I would like to add additional calculated columns to each DataFrame that is inside the Panel, preferably without a for-loop. 我想向面板内的每个DataFrame添加额外的计算列,最好没有for循环。 I'm attempting to use the apply function to the panel and name the new columns based on the original column name appended with a 'p' (for easier indexing later). 我试图在面板上使用apply函数,并根据原始列名(带有“ p”)(为以后的索引编制)命名新列。 Below is the code I am currently using. 下面是我当前正在使用的代码。

import pandas as pd
import numpy as np
import os.path

dir = "data/testsetup1/"
filelist = []

def initializeDataFrames():
    for f in os.listdir(dir):
        if ".csv" in f:
                filelist.append(dir + f)

    dd={}
    for f in filelist:
        dd[f[len(dir):(len(f)-4)]] = pd.read_csv(f)

    return pd.Panel(dd)

def newCalculation(pointSeries):
#test function, more complex functions to follow

    pointSeriesManiuplated = pointSeries.copy()

    percentageMove = 1.0/float(len(pointSeriesManiuplated)) 

    return pointSeriesManiuplated * percentageMove


myPanel = initializeDataFrames()
#calculatedPanel = myPanel.join(lambda x: myPanel[x,:,0:17].apply(lambda y:newCalculation(myPanel[x,:,0:17].ix[y])), rsuffix='p')
calculatedPanel = myPanel.ix[:,:,0:17].join(myPanel.ix[:,:,0:17].apply(lambda x: newCalculation(x), axis=2), rsuffix='p')

print calculatedPanel.values

The code above currently duplicates each DataFrame with the calculated columns instead of appending them to each DataFrame. 上面的代码当前使用已计算的列复制每个DataFrame,而不是将其附加到每个DataFrame。 The apply function I'm using operates on a Series object, which in this case would be a passed column. 我正在使用的apply函数对Series对象进行操作,在这种情况下,该对象将是传递的列。 My question is how can I use the apply function on a Panel such that it calculates new columns and appends them to each DataFrame? 我的问题是如何在面板上使用apply函数,以便它计算新列并将其附加到每个DataFrame?

Thanks in advance. 提前致谢。

If you want to add a new column via apply simply assign the output of the apply operation to the column you desire: 如果要通过apply添加新列,只需将apply操作的输出分配给所需列:

myPanel['new_column_suffix_p'] = myPanel.apply(newCalculation)

If you want multiple columns you can make a custom function for this: 如果需要多个列,可以为此创建一个自定义函数:

def calc_new_columns (rowset):
    rowset['newcolumn1'] = calculation1(rowset.columnofinterest)
    rowset['newcolumn2'] = calculation2(rowset.columnofinterest2 + rowset.column3)
    return rowset
myPanel = myPanel.apply(calc_new_columns)

On a broader note. 更广泛地讲。 You are manually handling sections of your data frame when it looks like you can just do the new column operation all at once. 当您看起来可以一次完成所有新列操作时,就在手动处理数据框的各个部分。 I would suggest importing the first csv file into a data frame. 我建议将第一个csv文件导入数据框。 Then loop through the remaining 299 csv and use DataFrame.append to add to the original data frame. 然后循环浏览其余的299 csv,并使用DataFrame.append添加到原始数据帧。 Then you would have one data frame for all the data that simple needs the calculated column added. 然后,您将为所有简单的需要添加计算列的数据提供一个数据框。

nit: "dir" is a builtin function. nit:“ dir”是一个内置函数。 you shouldn't use it as a variable name. 您不应将其用作变量名。

Try using a double transpose: 尝试使用双重转置:

p = pd.Panel(np.random.rand(4,10,17),
             items=pd.date_range('2013/11/10',periods=4),
             major_axis=range(10),
             minor_axis=map(lambda x: "col%d" % x, range(17)))

pT = p.transpose(2,1,0)
pT = pT.join(pT.apply(newCalculation, axis='major'), rsuffix='p')
p = pT.transpose(2,1,0)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM