Python Pandas：从 dataframe 中的值形成矩阵（二维数组）（忽略 NaN 值）

Question

I have a dataframe with 12 columns (Drug Categories) - where identical values (drug category name) could appear across the different columns.我有一个 dataframe 有 12 列（药物类别） - 其中相同的值（药物类别名称）可能出现在不同的列中。

                             DRG01                     DRG02  ...   DRG11 DRG12
0          AMOXYCILLIN ORAL SOLIDS   AMOEBICIDES ORAL SOLIDS  ...   NaN   NaN
1                    VITAMIN DROPS                       NaN  ...   NaN   NaN
2          AMOXYCILLIN ORAL SOLIDS   ANTIHISTAMINES ORAL LIQ  ...   NaN   NaN
3          AMOEBICIDES ORAL LIQUID                       NaN  ...   NaN   NaN
...                            ...                       ...  ...   ...   ...
81531                          NaN                       NaN  ...   NaN   NaN
[81532 rows x 12 columns]

My objective is to create a matrix (2D array) - with rows and columns consisting of the unique drug category names (ignoring/dropping the NaN values).我的目标是创建一个矩阵（二维数组） - 行和列由唯一的药物类别名称组成（忽略/删除 NaN 值）。 The value of the cells would be the number of times these drug category names appear together in a row.单元格的值将是这些药物类别名称连续出现的次数。 Essentially I'm trying to achieve something as below:本质上，我正在尝试实现以下目标：

                        AMOXYCILLIN ORAL SOLIDS  AMOEBICIDES ORAL SOLIDS  ANTIHISTAMINES ORALLIQ  VITAM..
AMOXYCILLIN ORAL SOLIDS      0                         1                       1                    0
AMOEBICIDES ORAL SOLIDS      1                         1                       0                    0
ANTIHISTAMINES ORAL LIQ      1                         0                       0                    0
VITAMIN DROPS                0                         0                       0                    1
.....
.....

Answer 1

like this?像这样？

from collections import Counter
from collections import defaultdict as dd
import pandas as pd

connection_counter = dd(lambda: Counter()) # count for every drug the time it appears with every other drug
def to_counter(row): #send each row to the connection_counter and add a connection to each value in the row with all other drugs in row  
    for drug_name in row:
        connection_counter[drug_name].update(row)
        connection_counter[drug_name].pop(drug_name,None) # so it won't count an appearance with itself

df.apply(lambda x: to_counter(x), axis = 1)  #df is the table you have 

df1 = pd.DataFrame()  # the table you want

for drug_name in connection_counter:
    df1 = df1.append(pd.DataFrame(connection_counter[drug_name],index = [drug_name]))

Answer 2

Using itertools.combinations and a few pandas function you can do it quite nicely:使用itertools.combinations和一些 pandas function 你可以做得很好：

pairs_df = pd.DataFrame(df.apply(lambda x: pd.Series(map(sorted, combinations(x, 2))), axis=1).stack().to_list())
# pairs_df has a row for every pair of drugs (in columns 0, 1).
pairs_df["occurrences"] = 1
pairs_df = pairs_df.groupby([0, 1]).sum()  # Group identical combinations and count occurences.
result_df = pairs_df.reset_index(level=1).pivot(columns=1)  # Pivot to create the requested shape.

Python Pandas：从 dataframe 中的值形成矩阵（二维数组）（忽略 NaN 值）

问题描述

2 个解决方案

解决方案1
0 2020-04-06 19:33:42

解决方案2
0 2020-04-06 23:32:51

Python Pandas：从 dataframe 中的值形成矩阵（二维数组）（忽略 NaN 值）

问题描述

2 个解决方案

解决方案1 0 2020-04-06 19:33:42

解决方案2 0 2020-04-06 23:32:51

解决方案1
0 2020-04-06 19:33:42

解决方案2
0 2020-04-06 23:32:51