简体   繁体   English

给定两个列表,在 dataframe 的列中查找共现?

[英]Finding the co-occurrences in the columns of a dataframe given two lists?

I am trying to clean a dataset in such a way that I want to find the co-occurrence of two strings coming from two separate lists in the columns of a dataframe, in order to obtain the frequency that those two events co-occur together.我正在尝试以这样一种方式清理数据集,即我想找到来自 dataframe 列中两个单独列表的两个字符串的共现,以获得这两个事件同时发生的频率。

My first list has a length of 27 as the following:我的第一个列表的长度为 27,如下所示:

df_dis = ['heart attack', 'panic disorder', 'bowel cancer' ...] 

And my second list has a length of 57 as the follows:我的第二个列表的长度为 57,如下所示:

df_sym = ['chest pain', 'weight loss', 'extreme hand movement'...]

My dataframe (df) is made up of 5 columns as follows, ( I am only showing the first 5 rows):我的 dataframe (df) 由 5 列组成,如下所示,(我只显示前 5 行):

    Diseases    Symptoms   Counts   Disease_str                  Symptoms_str
0   4464711     4831330     5289    heart attack                 chest pain
1   4147316     4402204     374     bowel obstructive cancer     weight loss
2   4317917     4317917     510     panic disorder               weight loss
3   4012264     5046090     1154    COPD                         panic attack
4   4819042     5136449     121     heart attack                 memory loss

The shape of this df is (18518404, 5).这个df的形状是(18518404, 5)。 This df will contain repeats of the events in both lists but also, they may contain one, two, all the words or additional words, so I am trying to pick up as many of those words (using the lists) to find how many times the events co-occur.此 df 将包含两个列表中事件的重复,但它们可能包含一个、两个、所有单词或其他单词,因此我试图选择尽可能多的这些单词(使用列表)来查找多少次事件同时发生。

What I did next to find the co-occurring events, I iterated over the dataframe's columns, Disease_str and Symptoms_str given the two lists, to get the.value_counts() from the Counts columns, as follows:我接下来要查找同时发生的事件,我在给定两个列表的情况下迭代了数据框的列、Disease_str 和Symbols_str,以从 Counts 列中获取 the.value_counts(),如下所示:

for i, j in map(df_dis, df_sys):
    val_counts_ = df['Counts'][(df['Disease_str'] == df_dis[i]) & (df['Symptoms_str'] == df_sys[j])].value_counts()

I am using the operand &, because I want the intersection rather than the union |.我正在使用操作数 &,因为我想要交集而不是并集 |。

However, I get an error message:但是,我收到一条错误消息:

TypeError: 'list' object is not callable

I have also tried zip(df_dis, df_sys) but instead, I still get an error message.我也尝试过zip(df_dis, df_sys)但我仍然收到一条错误消息。 This time it is a TypeError: list indices must be integers or slices, not str .这次是TypeError: list indices must be integers or slices, not str

What I would like to obtain is a csv file, that shows the combination of i & j in one column, the counts, and the total from the number of times i & j co-occurred?我想获得的是一个 csv 文件,它显示 i 和 j 在一列中的组合、计数以及 i 和 j 共同发生的次数的总数?

I would appreciate any help and since I am new to programming and pandas, I would also appreciate any explanations so I can jot them down in my notebook so I can try to understand them better.我会很感激任何帮助,因为我是编程和 pandas 的新手,我也很感激任何解释,所以我可以把它们记在笔记本上,这样我就可以更好地理解它们。

Thank you for the help.感谢您的帮助。

you can create a mask where the column Disease_str isin the list df_dys and same with Symptoms_str column.您可以创建一个掩码,其中列isin位于列表 df_dys 中,并且与症状_str 列相同。 Then you filter the rows with this mask, you groupby the two columns and agg on the column Counts to get the count and the sum .然后使用此掩码过滤行,按两列分组并在groupby列上进行agg以获得countsum Now to get all the possible combinations from your two lists, you can reindex with the MultiIndex.from_product of the two lists.现在要从两个列表中获取所有可能的组合,您可以使用两个列表的reindex MultiIndex.from_product

m = df['Disease_str'].isin(df_dis) & df['Symptoms_str'].isin(df_sym)
df_ = (df[m].groupby(['Disease_str', 'Symptoms_str'])
            ['Counts'].agg(['count','sum']) #or just ['Counts'].size() if you don't care of the sum
            .reindex(pd.MultiIndex.from_product([df_dis, df_sym], 
                                                names=['Disease_str', 'Symptoms_str']), 
                     fill_value=0)
            .reset_index()
      )
print (df_)
      Disease_str           Symptoms_str  count   sum
0    heart attack             chest pain      1  5289
1    heart attack            weight loss      0     0
2    heart attack  extreme hand movement      0     0
3  panic disorder             chest pain      0     0
4  panic disorder            weight loss      1   510
5  panic disorder  extreme hand movement      0     0
6    bowel cancer             chest pain      0     0
7    bowel cancer            weight loss      0     0
8    bowel cancer  extreme hand movement      0     0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM