如何根據條件在 python 中的列上拆分 74 行和 3234 列 dataframe

Question

比如說，我有一個維度為 (74, 3234)、74 行和 3234 列的數據框。 我有一個 function 來運行相關性分析。 但是，當我按原樣給出這個數據框時，打印結果需要很長時間。 現在我想將數據框分成多個塊。 並使用 function 中的夾頭。

數據框有 20,000 列，列名包含字符串_PC和 15000 列，字符串_lncRNAs 。

需要遵循的條件是，我需要將數據幀拆分為多個較小的 dataframe，其中包含具有_PC和_lncRNAs列名的列。 例如， df1必須包含 500 個帶有_lncRNAs _PC的列。

我設想有多個數據幀。 例如總是 74 行，但使用連續的列。 例如1-500, 501-1000, 10001 -1500, 1501-2000,以此類推，直到最后一列

 `df1.shape`
(74, 500)
df2.shape
(74, 500)

... 很快

一個例子

df1.head()
sam   END_PC  END2_PC END3_lncRNAs END4_lncRNAs
SAP1    50.9   30.4   49.0          50
SAP2      6    8.9     12.4 39.8   345.9888

然后，我需要在下面的 function 上使用每個拆分數據幀。

def correlation_analysis(lncRNA_PC_T):
    """
    Function for correlation analysis
    """
    correlations = pd.DataFrame()
    for PC in [column for column in lncRNA_PC_T.columns if '_PC' in column]: 
        for lncRNA in [column for column in lncRNA_PC_T.columns if '_lncRNAs' in column]:
                    correlations = correlations.append(pd.Series(pearsonr(lncRNA_PC_T[PC],lncRNA_PC_T[lncRNA]),index=['PCC', 'p-value'],name=PC + '_' +lncRNA))
    correlations.reset_index(inplace=True)
    correlations.rename(columns={0:'name'},inplace=True)
    correlations['PC']         = correlations['index'].apply(lambda x:x.split('PC')[0])
    correlations['lncRNAs']    = correlations['index'].apply(lambda x:x.split('PC')[1])
    correlations['lncRNAs']    = correlations['lncRNAs'].apply(lambda x:x.split('_')[1])
    correlations['PC']         = correlations.PC.str.strip('_')
    correlations.drop('index',axis=1,inplace=True)
    correlations               = correlations.reindex(columns=['PC','lncRNAs','PCC','p-value']) 
              
    return(correlations)

對於每個數據框 output 應該如下所示，

              gene          PCC   p-value
END_PC_END3_lncRNAs  -0.042027   0.722192
END2_PC_END3_lncRNAs  -0.017090   0.885088
END_PC_END4_lncRNAs    0.001417    0.990441
END2_PC_END3_lncRNAs  -0.041592   0.724954

我知道可以根據這樣的行進行拆分，

n = 200000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]

我想要這樣的基於列的東西。 非常感謝任何建議或幫助。 謝謝

Answer 1

df.iloc怎么樣？

並使用df.shape[1]作為列數：

list_df = [df.iloc[:, i:i+n] for i in range(0, df.shape[1], n)]

參考：如何在 pandas 中獲取 dataframe 的列切片

Answer 2

就像寫的 Basil 但使用 pandas.DataFrame.iloc

我不知道列標簽是什么。 所以為了使這個獨立於索引或列標簽，最好使用：

list_df = [df.iloc[:,i:i+n] for i in range(0, df.shape[1], n)]

見https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Answer 3

這就是我試圖查看評估 dataframe ( df ) 的行和列之間的相關性需要多長時間。 rows-correlation用時不到 50 毫秒，而columns-correlation用時不到 2 秒。

行相關輸出形狀： (74x74)
列相關輸出形狀： (3000x3000)

虛擬數據

import numpy as np
import pandas as pd

## Create Dummy Data
a = np.random.rand(74, 3000)
print(f'a.shape: {a.shape}')

## Create Dataframe
index = [f'R{i}' for i in range(a.shape[0])]
columns = [f'C{i}' for i in range(a.shape[1])]
df = pd.DataFrame(a, columns=columns, index=index)
df.shape # (74, 3000)

評估相關性

我在 jupyter 筆記本中做了以下操作

## Correlation between Rows of dfp
%%time
df.T.corr()
#CPU times: user 39.5 ms, sys: 1.09 ms, total: 40.6 ms
#Wall time: 41.3 ms

## Correlation between Columns of dfp
%%time
df.corr()
# CPU times: user 1.64 s, sys: 34.6 ms, total: 1.67 s
# Wall time: 1.67 s

Output: df.corr()

由於 dataframe 的形狀為(74, 3000) ， df.corr()產生形狀為(3000, 3000)的 dataframe 。

             C0        C1        C2  ...     C2997     C2998     C2999
C0     1.000000  0.064772  0.077853  ... -0.126288  0.033484 -0.154657
C1     0.064772  1.000000  0.031059  ...  0.064317  0.095075 -0.100423
C2     0.077853  0.031059  1.000000  ... -0.123791 -0.034085  0.052334
C3     0.070557  0.229482  0.047476  ...  0.043630 -0.055772  0.037123
C4     0.165782  0.189635 -0.009193  ... -0.123917  0.097660  0.074777
...         ...       ...       ...  ...       ...       ...       ...
C2995 -0.097033 -0.126214  0.051592  ...  0.008921 -0.004141  0.221091
C2996  0.099591  0.030975 -0.081584  ...  0.186931  0.084529  0.063596
C2997 -0.126288  0.064317 -0.123791  ...  1.000000  0.061555  0.024695
C2998  0.033484  0.095075 -0.034085  ...  0.061555  1.000000  0.195013
C2999 -0.154657 -0.100423  0.052334  ...  0.024695  0.195013  1.000000

Answer 4

如果你想要_lncRNAs _PC的列之間的相關性，你可以嘗試這樣的事情：

df_pc=df.filter(like='_PC')
df_lncRNAs=df.filter(like='_lncRNAs')
pd.concat([df_pc, df_lncRNAs], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']

例子：

import pandas as pd
df = pd.DataFrame({"a_pc":[1,2,3,4,5,6],
                  "b_pc":[3,210,12,412,512,61]
                   ,"c_pc": [1,2,3,4,5,6]
                 ,"d_lncRNAs": [3,210,12,412,512,61]
                 ,"d1_lncRNAs": [3,210,12,412,512,61]})

df_pc=df.filter(like='_pc')
df_lncRNAs=df.filter(like='_lncRNAs')
correlation=pd.concat([df_pc, df_lncRNAs], axis=1, keys=['df1', 'df2']).corr().loc['df2', 'df1']
correlation

Output：

df
   a_pc  b_pc  c_pc  d_lncRNAs  d1_lncRNAs
0     1     3     1          3           3
1     2   210     2        210         210
2     3    12     3         12          12
3     4   412     4        412         412
4     5   512     5        512         512
5     6    61     6         61          61

df_pc
   a_pc  b_pc  c_pc
0     1     3     1
1     2   210     2
2     3    12     3
3     4   412     4
4     5   512     5
5     6    61     6

df_lncRNAs 
   d_lncRNAs  d1_lncRNAs
0          3           3
1        210         210
2         12          12
3        412         412
4        512         512
5         61          61

correlation
                a_pc  b_pc      c_pc
d_lncRNAs   0.392799   1.0  0.392799
d1_lncRNAs  0.392799   1.0  0.392799

如何根據條件在 python 中的列上拆分 74 行和 3234 列 dataframe

問題描述

4 個解決方案

解決方案1
0 2020-07-16 18:12:20

解決方案2
0 2020-07-16 18:23:26

解決方案3
0 2020-07-16 19:31:26

相關 b/w 一些目標列和所有列

虛擬數據

評估相關性

解決方案4
0 已采納 2020-07-16 20:03:24

如何根據條件在 python 中的列上拆分 74 行和 3234 列 dataframe

問題描述

4 個解決方案

解決方案1 0 2020-07-16 18:12:20

解決方案2 0 2020-07-16 18:23:26

解決方案3 0 2020-07-16 19:31:26

相關 b/w 一些目標列和所有列

虛擬數據

評估相關性

解決方案4 0 已采納 2020-07-16 20:03:24

解決方案1
0 2020-07-16 18:12:20

解決方案2
0 2020-07-16 18:23:26

解決方案3
0 2020-07-16 19:31:26

解決方案4
0 已采納 2020-07-16 20:03:24