简体   繁体   English

Python Pandas:从 dataframe 中的值形成矩阵(二维数组)(忽略 NaN 值)

[英]Python Pandas: Forming a matrix (2D array) from the values in a dataframe (ignoring NaN values)

I have a dataframe with 12 columns (Drug Categories) - where identical values (drug category name) could appear across the different columns.我有一个 dataframe 有 12 列(药物类别) - 其中相同的值(药物类别名称)可能出现在不同的列中。

                             DRG01                     DRG02  ...   DRG11 DRG12
0          AMOXYCILLIN ORAL SOLIDS   AMOEBICIDES ORAL SOLIDS  ...   NaN   NaN
1                    VITAMIN DROPS                       NaN  ...   NaN   NaN
2          AMOXYCILLIN ORAL SOLIDS   ANTIHISTAMINES ORAL LIQ  ...   NaN   NaN
3          AMOEBICIDES ORAL LIQUID                       NaN  ...   NaN   NaN
...                            ...                       ...  ...   ...   ...
81531                          NaN                       NaN  ...   NaN   NaN
[81532 rows x 12 columns]

My objective is to create a matrix (2D array) - with rows and columns consisting of the unique drug category names (ignoring/dropping the NaN values).我的目标是创建一个矩阵(二维数组) - 行和列由唯一的药物类别名称组成(忽略/删除 NaN 值)。 The value of the cells would be the number of times these drug category names appear together in a row.单元格的值将是这些药物类别名称连续出现的次数。 Essentially I'm trying to achieve something as below:本质上,我正在尝试实现以下目标:

                        AMOXYCILLIN ORAL SOLIDS  AMOEBICIDES ORAL SOLIDS  ANTIHISTAMINES ORALLIQ  VITAM..
AMOXYCILLIN ORAL SOLIDS      0                         1                       1                    0
AMOEBICIDES ORAL SOLIDS      1                         1                       0                    0
ANTIHISTAMINES ORAL LIQ      1                         0                       0                    0
VITAMIN DROPS                0                         0                       0                    1
.....
.....

like this?像这样?

from collections import Counter
from collections import defaultdict as dd
import pandas as pd

connection_counter = dd(lambda: Counter()) # count for every drug the time it appears with every other drug
def to_counter(row): #send each row to the connection_counter and add a connection to each value in the row with all other drugs in row  
    for drug_name in row:
        connection_counter[drug_name].update(row)
        connection_counter[drug_name].pop(drug_name,None) # so it won't count an appearance with itself

df.apply(lambda x: to_counter(x), axis = 1)  #df is the table you have 

df1 = pd.DataFrame()  # the table you want

for drug_name in connection_counter:
    df1 = df1.append(pd.DataFrame(connection_counter[drug_name],index = [drug_name]))

Using itertools.combinations and a few pandas function you can do it quite nicely:使用itertools.combinations和一些 pandas function 你可以做得很好:

pairs_df = pd.DataFrame(df.apply(lambda x: pd.Series(map(sorted, combinations(x, 2))), axis=1).stack().to_list())
# pairs_df has a row for every pair of drugs (in columns 0, 1).
pairs_df["occurrences"] = 1
pairs_df = pairs_df.groupby([0, 1]).sum()  # Group identical combinations and count occurences.
result_df = pairs_df.reset_index(level=1).pivot(columns=1)  # Pivot to create the requested shape.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM