简体   繁体   中英

How to apply functions with multiple arguments on Pandas selected columns data frame

I have the following data frame:

import pandas as pd 
data = {'gene':['a','b','c','d','e'],
        'count':[61,320,34,14,33],
        'gene_length':[152,86,92,170,111]}
df = pd.DataFrame(data)
df = df[["gene","count","gene_length"]]

That looks like this:

In [9]: df
Out[9]:
  gene  count  gene_length
0    a     61          152
1    b    320           86
2    c     34           92
3    d     14          170
4    e     33          111

What I want to do is to apply a function:

def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = float((10**9) * theC)/(theN * theL)
    return rpkm

On to count and gene_length columns and a constant N=12345 and name the new result as 'rpkm'. But why this failed?

N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])

What's the right way to do it? The first row should look something like this:

 gene  count  gene_length rpkm
   a     61          152  32508.366

Update: the error I got is this:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-6270e1d19b89> in <module>()
----> 1 df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])

<ipython-input-1-48e311ca02f3> in calculate_RPKM(theC, theN, theL)
     13     theN  == Total reads mapped
     14     """
---> 15     rpkm = float((10**9) * theC)/(theN * theL)
     16     return rpkm

/u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
     74             return converter(self.iloc[0])
     75         raise TypeError(
---> 76             "cannot convert the series to {0}".format(str(converter)))
     77     return wrapper
     78

The DataFrame.apply method takes a parameter axis which when set to 1 sends the whole row into the apply function. This makes it a lot slower than a normal apply function since it is no longer a proper monoid lambda function . But it does work.

Like this:

N=12345
df["rpkm"] = df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)

Don't cast to float in your method and it will work fine:

In [9]:
def calculate_RPKM(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm
N=12345
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])
df

Out[9]:
  gene  count  gene_length           rpkm
0    a     61          152   32508.366908
1    b    320           86  301411.926493
2    c     34           92   29936.429112
3    d     14          170    6670.955138
4    e     33          111   24082.405613

The error message is telling you that you cannot cast a pandas Series to a float , whilst you could call apply to call your method row-wise. You should look at rewriting your method so that it can work on the entire Series , this will be vectorised and be much faster than calling apply which is essentially a for loop.

Timings

In [11]:

def calculate_RPKM1(theC,theN, theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10**9) * theC)/(theN * theL)
    return rpkm
​
def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = float((10**9) * theC)/(theN * theL)
    return rpkm
N=12345

%timeit calculate_RPKM1(df['count'],N,df['gene_length'])
%timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)

1000 loops, best of 3: 238 µs per loop
100 loops, best of 3: 1.5 ms per loop

You can see that the non casting version is over 6X faster and will be even more performant on larger datasets

Update

The following code along with using the non-casting float version of your method is semantically equivalent:

df['rpkm'] = calculate_RPKM1(df['count'].astype(float),N,df['gene_length'])
df

Out[16]:
  gene  count  gene_length           rpkm
0    a     61          152   32508.366908
1    b    320           86  301411.926493
2    c     34           92   29936.429112
3    d     14          170    6670.955138
4    e     33          111   24082.405613

This seems to be fixed simply by removing the float requirement in the function definition, the operation is applied down the two series entirely:

def calculate_RPKM(theC,theN,theL):
    """
    theC  == Total reads mapped to a feature (gene/linc)
    theL  == Length of feature (gene/linc)
    theN  == Total reads mapped
    """
    rpkm = ((10 ** 9) * theC)/(theN * theL)
    return rpkm

df['rpkm'] = calculate_RPKM(df['count'], N, df['gene_length'])

The output of df['rpkm']

0     32508.366908
1    301411.926493
2     29936.429112
3      6670.955138
4     24082.405613
Name: rpkm, dtype: float64

If you want to be entirely sure that the output is a float you could pass the two series in changed to floats:

counts = df['count'].astype(float)
lengths = df['gene_length'].astype(float)

df['rpkm'] = calculate_RPKM(counts, N, lengths)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM