如何在Pandas选中的列数据框中应用具有多个参数的函数

Question

我有以下数据框：

import pandas as pd 
data = {'gene':['a','b','c','d','e'],
        'count':[61,320,34,14,33],
        'gene_length':[152,86,92,170,111]}
df = pd.DataFrame(data)
df = df[["gene","count","gene_length"]]

看起来像这样：

In [9]: df
Out[9]:
  gene  count  gene_length
0    a     61          152
1    b    320           86
2    c     34           92
3    d     14          170
4    e     33          111

我想要做的是应用一个函数：

def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = float((10**9) * theC)/(theN * theL)
    return rpkm

在count和gene_length列和常量N=12345 ，并将新结果命名为'rpkm'。 但为什么这会失败？

N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])

什么是正确的方法呢？ 第一行应该如下所示：

 gene  count  gene_length rpkm
   a     61          152  32508.366

更新：我得到的错误是这样的：

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-6270e1d19b89> in <module>()
----> 1 df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])

<ipython-input-1-48e311ca02f3> in calculate_RPKM(theC, theN, theL)
     13     theN  == Total reads mapped
     14     """
---> 15     rpkm = float((10**9) * theC)/(theN * theL)
     16     return rpkm

/u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
     74             return converter(self.iloc[0])
     75         raise TypeError(
---> 76             "cannot convert the series to {0}".format(str(converter)))
     77     return wrapper
     78

Answer 1

DataFrame.apply方法采用参数axis ，当设置为1时，将整行发送到apply函数。 这使得它比普通的apply函数慢很多，因为它不再是一个合适的monoid lambda函数。 但它确实有效。

像这样：

N=12345
df["rpkm"] = df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)

Answer 2

不要在你的方法中float ，它会正常工作：

In [9]:
def calculate_RPKM(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
df

Out[9]:
  gene  count  gene_length           rpkm
0    a     61          152   32508.366908
1    b    320           86  301411.926493
2    c     34           92   29936.429112
3    d     14          170    6670.955138
4    e     33          111   24082.405613

错误消息告诉您不能将pandas Series转换为float ，而您可以调用apply来逐行调用您的方法。 您应该考虑重写您的方法，以便它可以在整个Series上工作，这将被矢量化并且比调用apply本质上是for循环要快得多。

计时

In [11]:

def calculate_RPKM1(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm

def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = float((10**9) * theC)/(theN * theL)
    return rpkm
N=12345

%timeit calculate_RPKM1(df['count'],N,df['gene_length'])
%timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)

1000 loops, best of 3: 238 µs per loop
100 loops, best of 3: 1.5 ms per loop

您可以看到非强制转换版本的速度提高了6倍以上，并且在更大的数据集上的性能更高

更新

以下代码以及使用方法的非转换float版本在语义上是等效的：

df['rpkm'] = calculate_RPKM1(df['count'].astype(float),N,df['gene_length'])
df

Out[16]:
  gene  count  gene_length           rpkm
0    a     61          152   32508.366908
1    b    320           86  301411.926493
2    c     34           92   29936.429112
3    d     14          170    6670.955138
4    e     33          111   24082.405613

Answer 3

这似乎只是通过删除函数定义中的float要求来修复，操作完全应用于两个系列：

def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10 ** 9) * theC)/(theN * theL)
    return rpkm

df['rpkm'] = calculate_RPKM(df['count'], N, df['gene_length'])

df['rpkm']的输出

0     32508.366908
1    301411.926493
2     29936.429112
3      6670.955138
4     24082.405613
Name: rpkm, dtype: float64

如果你想完全确定输出是一个浮点数，你可以将两个系列更改为浮点数：

counts = df['count'].astype(float)
lengths = df['gene_length'].astype(float)

df['rpkm'] = calculate_RPKM(counts, N, lengths)

如何在Pandas选中的列数据框中应用具有多个参数的函数

问题描述

3 个解决方案

解决方案1
1 2015-06-15 08:58:23

解决方案2
1 已采纳 2015-06-15 09:02:28

解决方案3
1 2015-06-15 09:08:24

如何在Pandas选中的列数据框中应用具有多个参数的函数

问题描述

3 个解决方案

解决方案1 1 2015-06-15 08:58:23

解决方案2 1 已采纳 2015-06-15 09:02:28

解决方案3 1 2015-06-15 09:08:24

解决方案1
1 2015-06-15 08:58:23

解决方案2
1 已采纳 2015-06-15 09:02:28

解决方案3
1 2015-06-15 09:08:24