將函數應用於pandas數據幀的每一行以創建兩個新列

Question

我有一個pandas DataFrame， st包含多個列：

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 53732 entries, 1993-01-07 12:23:58 to 2012-12-02 20:06:23
Data columns:
Date(dd-mm-yy)_Time(hh-mm-ss)       53732  non-null values
Julian_Day                          53732  non-null values
AOT_1020                            53716  non-null values
AOT_870                             53732  non-null values
AOT_675                             53188  non-null values
AOT_500                             51687  non-null values
AOT_440                             53727  non-null values
AOT_380                             51864  non-null values
AOT_340                             52852  non-null values
Water(cm)                           51687  non-null values
%TripletVar_1020                    53710  non-null values
%TripletVar_870                     53726  non-null values
%TripletVar_675                     53182  non-null values
%TripletVar_500                     51683  non-null values
%TripletVar_440                     53721  non-null values
%TripletVar_380                     51860  non-null values
%TripletVar_340                     52846  non-null values
440-870Angstrom                     53732  non-null values
380-500Angstrom                     52253  non-null values
440-675Angstrom                     53732  non-null values
500-870Angstrom                     53732  non-null values
340-440Angstrom                     53277  non-null values
Last_Processing_Date(dd/mm/yyyy)    53732  non-null values
Solar_Zenith_Angle                  53732  non-null values
dtypes: datetime64[ns](1), float64(22), object(1)

我想基於將函數應用於數據幀的每一行，為此數據幀創建兩個新列。 我不想多次調用該函數（例如，通過執行兩次單獨的apply調用），因為它是計算密集型的。 我嘗試過兩種方式，但它們都不起作用：

使用apply ：

我編寫了一個函數，它接受一個Series並返回我想要的值的元組：

def calculate(s):
    a = s['path'] + 2*s['row'] # Simple calc for example
    b = s['path'] * 0.153
    return (a, b)

嘗試將此應用於DataFrame會出錯：

st.apply(calculate, axis=1)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-248-acb7a44054a7> in <module>()
----> 1 st.apply(calculate, axis=1)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, args, **kwds)
   4191                     return self._apply_raw(f, axis)
   4192                 else:
-> 4193                     return self._apply_standard(f, axis)
   4194             else:
   4195                 return self._apply_broadcast(f, axis)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures)
   4274                 index = None
   4275 
-> 4276             result = self._constructor(data=results, index=index)
   4277             result.rename(columns=dict(zip(range(len(res_index)), res_index)),
   4278                           inplace=True)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in __init__(self, data, index, columns, dtype, copy)
    390             mgr = self._init_mgr(data, index, columns, dtype=dtype, copy=copy)
    391         elif isinstance(data, dict):
--> 392             mgr = self._init_dict(data, index, columns, dtype=dtype)
    393         elif isinstance(data, ma.MaskedArray):
    394             mask = ma.getmaskarray(data)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _init_dict(self, data, index, columns, dtype)
    521 
    522         return _arrays_to_mgr(arrays, data_names, index, columns,
--> 523                               dtype=dtype)
    524 
    525     def _init_ndarray(self, values, index, columns, dtype=None,

C:\Python27\lib\site-packages\pandas\core\frame.pyc in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5411 
   5412     # consolidate for now
-> 5413     mgr = BlockManager(blocks, axes)
   5414     return mgr.consolidate()
   5415 

C:\Python27\lib\site-packages\pandas\core\internals.pyc in __init__(self, blocks, axes, do_integrity_check)
    802 
    803         if do_integrity_check:
--> 804             self._verify_integrity()
    805 
    806         self._consolidate_check()

C:\Python27\lib\site-packages\pandas\core\internals.pyc in _verify_integrity(self)
    892                                      "items")
    893             if block.values.shape[1:] != mgr_shape[1:]:
--> 894                 raise AssertionError('Block shape incompatible with manager')
    895         tot_items = sum(len(x.items) for x in self.blocks)
    896         if len(self.items) != tot_items:

AssertionError: Block shape incompatible with manager

然后，我將使用此問題中顯示的方法apply返回的值分配給兩個新列。 但是，我甚至無法達到這一點！ 如果我只返回一個值，這一切都正常。

使用循環：

我首先創建了兩個新的數據幀列，並將它們設置為None ：

st['a'] = None
st['b'] = None

然后循環遍歷所有索引並嘗試修改我在那里得到的這些None值，但我做的修改似乎不起作用。 也就是說，沒有生成錯誤，但似乎沒有修改DataFrame。

for i in st.index:
    # do calc here
    st.ix[i]['a'] = a
    st.ix[i]['b'] = b

我認為這兩種方法都可行，但它們都沒有。 那么，我在這里做錯了什么？ 什么是最好的，最“pythonic”和“pandaonic”的方式來做到這一點？

Answer 1

要使第一個方法起作用，請嘗試返回一個Series而不是一個元組（apply正在拋出異常，因為它不知道如何將行粘合在一起，因為列數與原始幀不匹配）。

def calculate(s):
    a = s['path'] + 2*s['row'] # Simple calc for example
    b = s['path'] * 0.153
    return pd.Series(dict(col1=a, col2=b))

如果您更換，第二種方法應該有效：

st.ix[i]['a'] = a

有：

st.ix[i, 'a'] = a

Answer 2

我總是使用lambdas和內置的map()函數通過組合其他行來創建新行：

st['a'] = map(lambda path, row: path + 2 * row, st['path'], st['row'])

對於進行數值列的線性組合，可能稍微復雜一些。 另一方面，我認為采用一種約定是好的，因為它可以用於更復雜的行組合（例如使用字符串）或使用其他列的函數填充列中的缺失數據。

例如，假設您有一個包含列性別和標題的表格，並且缺少某些標題。 您可以使用以下函數填充它們：

title_dict = {'male': 'mr.', 'female': 'ms.'}
table['title'] = map(lambda title,
    gender: title if title != None else title_dict[gender],
    table['title'], table['gender'])

Answer 3

這在這里解決了：將pandas函數應用於列以創建多個新列？

應用於您的問題，這應該工作：

def calculate(s):
    a = s['path'] + 2*s['row'] # Simple calc for example
    b = s['path'] * 0.153
    return pd.Series({'col1': a, 'col2': b})

df = df.merge(df.apply(calculate, axis=1), left_index=True, right_index=True)

Answer 4

另一種基於在方法鏈中分配新列的解決方案：

st.assign(a = st['path'] + 2*st['row'], b = st['path'] * 0.153)

請注意， assign 始終返回數據的副本，保持原始DataFrame不變。

將函數應用於pandas數據幀的每一行以創建兩個新列

問題描述

4 個解決方案

解決方案1
27 已采納 2013-02-28 01:21:16

解決方案2
18 2014-06-14 18:14:40

解決方案3
5 2013-07-23 13:48:07

解決方案4
0 2016-05-10 05:11:29

將函數應用於pandas數據幀的每一行以創建兩個新列

問題描述

4 個解決方案

解決方案1 27 已采納 2013-02-28 01:21:16

解決方案2 18 2014-06-14 18:14:40

解決方案3 5 2013-07-23 13:48:07

解決方案4 0 2016-05-10 05:11:29

解決方案1
27 已采納 2013-02-28 01:21:16

解決方案2
18 2014-06-14 18:14:40

解決方案3
5 2013-07-23 13:48:07

解決方案4
0 2016-05-10 05:11:29