I have the following data frame:
import pandas as pd
data = {'gene':['a','b','c','d','e'],
'count':[61,320,34,14,33],
'gene_length':[152,86,92,170,111]}
df = pd.DataFrame(data)
df = df[["gene","count","gene_length"]]
That looks like this:
In [9]: df
Out[9]:
gene count gene_length
0 a 61 152
1 b 320 86
2 c 34 92
3 d 14 170
4 e 33 111
What I want to do is to apply a function:
def calculate_RPKM(theC,theN,theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = float((10**9) * theC)/(theN * theL)
return rpkm
On to count
and gene_length
columns and a constant N=12345
and name the new result as 'rpkm'. But why this failed?
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
What's the right way to do it? The first row should look something like this:
gene count gene_length rpkm
a 61 152 32508.366
Update: the error I got is this:
--------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-4-6270e1d19b89> in <module>()
----> 1 df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
<ipython-input-1-48e311ca02f3> in calculate_RPKM(theC, theN, theL)
13 theN == Total reads mapped
14 """
---> 15 rpkm = float((10**9) * theC)/(theN * theL)
16 return rpkm
/u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
74 return converter(self.iloc[0])
75 raise TypeError(
---> 76 "cannot convert the series to {0}".format(str(converter)))
77 return wrapper
78
The DataFrame.apply
method takes a parameter axis
which when set to 1 sends the whole row into the apply function. This makes it a lot slower than a normal apply function since it is no longer a proper monoid lambda function . But it does work.
Like this:
N=12345
df["rpkm"] = df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)
Don't cast to float
in your method and it will work fine:
In [9]:
def calculate_RPKM(theC,theN, theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = ((10**9) * theC)/(theN * theL)
return rpkm
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
df
Out[9]:
gene count gene_length rpkm
0 a 61 152 32508.366908
1 b 320 86 301411.926493
2 c 34 92 29936.429112
3 d 14 170 6670.955138
4 e 33 111 24082.405613
The error message is telling you that you cannot cast a pandas Series to a float
, whilst you could call apply
to call your method row-wise. You should look at rewriting your method so that it can work on the entire Series
, this will be vectorised and be much faster than calling apply
which is essentially a for
loop.
Timings
In [11]:
def calculate_RPKM1(theC,theN, theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = ((10**9) * theC)/(theN * theL)
return rpkm
def calculate_RPKM(theC,theN,theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = float((10**9) * theC)/(theN * theL)
return rpkm
N=12345
%timeit calculate_RPKM1(df['count'],N,df['gene_length'])
%timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)
1000 loops, best of 3: 238 µs per loop
100 loops, best of 3: 1.5 ms per loop
You can see that the non casting version is over 6X faster and will be even more performant on larger datasets
Update
The following code along with using the non-casting float
version of your method is semantically equivalent:
df['rpkm'] = calculate_RPKM1(df['count'].astype(float),N,df['gene_length'])
df
Out[16]:
gene count gene_length rpkm
0 a 61 152 32508.366908
1 b 320 86 301411.926493
2 c 34 92 29936.429112
3 d 14 170 6670.955138
4 e 33 111 24082.405613
This seems to be fixed simply by removing the float requirement in the function definition, the operation is applied down the two series entirely:
def calculate_RPKM(theC,theN,theL):
"""
theC == Total reads mapped to a feature (gene/linc)
theL == Length of feature (gene/linc)
theN == Total reads mapped
"""
rpkm = ((10 ** 9) * theC)/(theN * theL)
return rpkm
df['rpkm'] = calculate_RPKM(df['count'], N, df['gene_length'])
The output of df['rpkm']
0 32508.366908
1 301411.926493
2 29936.429112
3 6670.955138
4 24082.405613
Name: rpkm, dtype: float64
If you want to be entirely sure that the output is a float you could pass the two series in changed to floats:
counts = df['count'].astype(float)
lengths = df['gene_length'].astype(float)
df['rpkm'] = calculate_RPKM(counts, N, lengths)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.