简体   繁体   English

使用Pandas迭代地将列添加到数据帧

[英]Using Pandas to Iteratively Add Columns to a Dataframe

I have some relatively simple code that I'm struggling to put together. 我有一些相对简单的代码,我正在努力拼凑起来。 I have a CSV that I've read into a dataframe. 我有一个CSV,我已读入数据帧。 The CSV is panel data (ie, unique company and year observations for each row). CSV是面板数据(即每行的唯一公司和年度观察)。 I have two columns that I want to perform a function on and then I want to create new variables based on the output of the function. 我有两列我想要执行一个函数,然后我想根据函数的输出创建新的变量。

Here's what I have so far with code: 这是我到目前为止的代码:

#Loop through rows in a CSV file
for index, rows in df.iterrows():
    #Start at column 6 and go to the end of the file
    for row in rows[6:]:
        data = perform_function1( row )
        output =  perform_function2(data)    
        df.ix[index, 'new_variable'] = output
        print output

I want this code to iterate starting in column 6 and then going to the end of the file (eg, I have two columns I want to perform the function on Column6 and Column7) and then create new columns based on the functions that were performed (eg, Output6 and Output7). 我希望此代码从第6列开始迭代,然后转到文件的末尾(例如,我有两列我想在Column6和Column7上执行该函数),然后根据执行的函数创建新列(例如,Output6和Output7)。 The code above returns the output for Column7, but I can't figure out how to create a variable that allows me to capture the outputs from both columns (ie, a new variable that isn't overwritten by loop). 上面的代码返回Column7的输出,但我无法弄清楚如何创建一个允许我捕获两列输出的变量(即,一个未被循环覆盖的新变量)。 I searched Stackoverflow and didn't see anything that immediately related to my question (maybe because I'm too big of a noob?). 我搜索了Stackoverflow并没有看到任何与我的问题直接相关的东西(也许是因为我太大了一个菜鸟?)。 I would really appreciate your help. 我将衷心感谢您的帮助。

Thanks, 谢谢,

TT TT

PS I'm not sure if I've provided enough detail. PS我不确定我是否提供了足够的细节。 Please let me know if I need to provide more. 如果我需要提供更多,请告诉我。

Operating iteratively doesn't take advantage of Pandas' capabilities. 迭代操作不会利用Pandas的功能。 Pandas' strength is in applying operations efficiently across the whole dataframe, rather than in iterating row by row. Pandas的优势在于在整个数据框架中有效地应用操作,而不是逐行迭代。 It's great for a task like this where you want to chain a few functions across your data. 对于像这样的任务来说,这非常适合您想要在数据中链接一些函数。 You should be able to accomplish your whole task in a single line. 您应该能够在一行中完成整个任务。

df["new_variable"] = df.ix[6:].apply(perform_function1).apply(perform_function2)

perform_function1 will be applied to each row, and perform_function2 will be applied to the results of the first function. perform_function1将应用于每一行, perform_function2将应用于第一个函数的结果。

If you want to apply function to certain columns in a dataframe 如果要将函数应用于数据框中的某些列

# Get the Series
colmun6 = df.ix[:, 5]  
# perform_function1 applied to each row
output6 = column6.apply(perform_function1)  
df["new_variable"] = output6

Pandas is quite slow acting row-by-row: you're much better off using the append , concat , merge , or join functionalities on the whole dataframe. Pandas逐行行动很慢:在整个数据帧上使用appendconcatmergejoin功能要好得多。

To give some idea why, let's consider a random DataFrame example: 为了解一下原因,让我们考虑一个随机的DataFrame示例:

import numpy as np
import pandas as pd
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df2 = df.copy()
# operation to concatenate two dataframes
%timeit pd.concat([df2, df])
1000 loops, best of 3: 737 µs per loop
 %timeit df.loc['2013-01-01']
1000 loops, best of 3: 251 µs per loop
# single element operation
%timeit df.loc['2013-01-01', 'A'] = 3
1000 loops, best of 3: 218 µs per loop

Notice how efficiently Pandas handles entire dataFrame operations, and how inefficiently it handles operations on single elements? 请注意Pandas如何有效地处理整个dataFrame操作,以及它如何低效地处理单个元素上的操作?

If we expand this, the same tendency occurs, only is much more pronounced: 如果我们扩展这个,就会出现同样的趋势,只会更加明显:

df = pd.DataFrame(np.random.randn(200, 300))
# single element operation
%timeit df.loc[1,1] = 3
10000 loops, best of 3: 74.6 µs per loop
df2 = df.copy()
# full dataframe operation
%timeit pd.concat([df2, df])
1000 loops, best of 3: 830 µs per loop

Pandas performs an operation on the whole, 200x300 DataFrame about 6,000 times faster than it does for an operation on a single element. Pandas整体上执行操作,200x300 DataFrame比单个元素上的操作快6,000倍。 In short, the iteration would kill the whole purpose of using Pandas. 简而言之,迭代会破坏使用Pandas的全部目的。 If you're accessing a dataframe element-by-element, consider using a dictionary instead. 如果您逐个元素地访问数据框,请考虑使用字典。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM