![](/img/trans.png)
[英]How to apply a function only on selected rows and columns of pandas data frame?
[英]How to apply functions with multiple arguments on Pandas selected columns data frame
我有以下数据框:
import pandas as pd
data = {'gene':['a','b','c','d','e'],
'count':[61,320,34,14,33],
'gene_length':[152,86,92,170,111]}
df = pd.DataFrame(data)
df = df[["gene","count","gene_length"]]
看起来像这样:
In [9]: df
Out[9]:
gene count gene_length
0 a 61 152
1 b 320 86
2 c 34 92
3 d 14 170
4 e 33 111
我想要做的是应用一个函数:
def calculate_RPKM(theC,theN,theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = float((10**9) * theC)/(theN * theL)
return rpkm
在count
和gene_length
列和常量N=12345
,并将新结果命名为'rpkm'。 但为什么这会失败?
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
什么是正确的方法呢? 第一行应该如下所示:
gene count gene_length rpkm
a 61 152 32508.366
更新:我得到的错误是这样的:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-6270e1d19b89> in <module>()
----> 1 df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
<ipython-input-1-48e311ca02f3> in calculate_RPKM(theC, theN, theL)
13 theN == Total reads mapped
14 """
---> 15 rpkm = float((10**9) * theC)/(theN * theL)
16 return rpkm
/u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
74 return converter(self.iloc[0])
75 raise TypeError(
---> 76 "cannot convert the series to {0}".format(str(converter)))
77 return wrapper
78
DataFrame.apply
方法采用参数axis
,当设置为1时,将整行发送到apply函数。 这使得它比普通的apply函数慢很多,因为它不再是一个合适的monoid lambda函数 。 但它确实有效。
像这样:
N=12345
df["rpkm"] = df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)
不要在你的方法中float
,它会正常工作:
In [9]:
def calculate_RPKM(theC,theN, theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = ((10**9) * theC)/(theN * theL)
return rpkm
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
df
Out[9]:
gene count gene_length rpkm
0 a 61 152 32508.366908
1 b 320 86 301411.926493
2 c 34 92 29936.429112
3 d 14 170 6670.955138
4 e 33 111 24082.405613
错误消息告诉您不能将pandas Series转换为float
,而您可以调用apply
来逐行调用您的方法。 您应该考虑重写您的方法,以便它可以在整个Series
上工作,这将被矢量化并且比调用apply
本质上是for
循环要快得多。
计时
In [11]:
def calculate_RPKM1(theC,theN, theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = ((10**9) * theC)/(theN * theL)
return rpkm
def calculate_RPKM(theC,theN,theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = float((10**9) * theC)/(theN * theL)
return rpkm
N=12345
%timeit calculate_RPKM1(df['count'],N,df['gene_length'])
%timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)
1000 loops, best of 3: 238 µs per loop
100 loops, best of 3: 1.5 ms per loop
您可以看到非强制转换版本的速度提高了6倍以上,并且在更大的数据集上的性能更高
更新
以下代码以及使用方法的非转换float
版本在语义上是等效的:
df['rpkm'] = calculate_RPKM1(df['count'].astype(float),N,df['gene_length'])
df
Out[16]:
gene count gene_length rpkm
0 a 61 152 32508.366908
1 b 320 86 301411.926493
2 c 34 92 29936.429112
3 d 14 170 6670.955138
4 e 33 111 24082.405613
这似乎只是通过删除函数定义中的float要求来修复,操作完全应用于两个系列:
def calculate_RPKM(theC,theN,theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = ((10 ** 9) * theC)/(theN * theL)
return rpkm
df['rpkm'] = calculate_RPKM(df['count'], N, df['gene_length'])
df['rpkm']
的输出
0 32508.366908
1 301411.926493
2 29936.429112
3 6670.955138
4 24082.405613
Name: rpkm, dtype: float64
如果你想完全确定输出是一个浮点数,你可以将两个系列更改为浮点数:
counts = df['count'].astype(float)
lengths = df['gene_length'].astype(float)
df['rpkm'] = calculate_RPKM(counts, N, lengths)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.