简体   繁体   English

如何在熊猫数据框上应用Scipy功能

[英]How to apply scipy function on Pandas data frame

I have the following data frame: 我有以下数据框:

import pandas as pd
import io
from scipy import stats

temp=u"""probegenes,sample1,sample2,sample3
1415777_at Pnliprp1,20,0.00,11
1415805_at Clps,17,0.00,55
1415884_at Cela3b,47,0.00,100"""
df = pd.read_csv(io.StringIO(temp),index_col='probegenes')
df

It looks like this 看起来像这样

                     sample1  sample2  sample3
probegenes
1415777_at Pnliprp1       20        0       11
1415805_at Clps           17        0       55
1415884_at Cela3b         47        0      100

What I want to do is too perform row-zscore calculation using SCIPY . 我也想使用SCIPY执行row-zscore计算 Using this code I get: 使用此代码,我得到:

In [98]: stats.zscore(df,axis=1)
Out[98]:
array([[ 1.18195176, -1.26346568,  0.08151391],
       [-0.30444376, -1.04380717,  1.34825093],
       [-0.04896043, -1.19953047,  1.2484909 ]])

How can I conveniently attached the columns and index name back again to that result? 如何方便地将列和索引名称重新附加到该结果?

At the end of the day. 在一天结束时。 It'll look like: 它看起来像:

                               sample1  sample2  sample3
probegenes
1415777_at Pnliprp1      1.18195176, -1.26346568,  0.08151391
1415805_at Clps         -0.30444376, -1.04380717,  1.34825093
1415884_at Cela3b        -0.04896043, -1.19953047,  1.2484909

The documentation for pd.DataFrame has: pd.DataFrame文档具有:

data : numpy ndarray (structured or homogeneous), dict, or DataFrame Dict can contain Series, arrays, constants, or list-like objects index : Index or array-like Index to use for resulting frame. data :numpy ndarray(结构化或均质化),dict或DataFrame Dict可以包含Series,数组,常量或类似列表的对象index :用于生成结果帧的Index或类似array的Index。 Will default to np.arange(n) if no indexing information part of input data and no index provided columns : Index or array-like Column labels to use for resulting frame. 如果没有输入数据的索引信息部分并且没有提供索引,则默认为np.arange(n) :用于结果帧的索引或类似数组的列标签。 Will default to np.arange(n) if no column labels are provided 如果未提供列标签,则默认为np.arange(n)

So, 所以,

pd.DataFrame(
    stats.zscore(df,axis=1),
    index=df.index,
    columns=df.columns)

should do the job. 应该做的工作。

You don't need scipy. 你不需要臭味。 You can do it using a lambda function: 您可以使用lambda函数来做到这一点:

>>> df.apply(lambda row: (row - row.mean()) / row.std(ddof=0), axis=1) 
                      sample1   sample2   sample3
probegenes                                       
1415777_at Pnliprp1  1.181952 -1.263466  0.081514
1415805_at Clps     -0.304444 -1.043807  1.348251
1415884_at Cela3b   -0.048960 -1.199530  1.248491

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM