[英]Finding the co-occurrences in the columns of a dataframe given two lists?
I am trying to clean a dataset in such a way that I want to find the co-occurrence of two strings coming from two separate lists in the columns of a dataframe, in order to obtain the frequency that those two events co-occur together.我正在尝试以这样一种方式清理数据集,即我想找到来自 dataframe 列中两个单独列表的两个字符串的共现,以获得这两个事件同时发生的频率。
My first list has a length of 27 as the following:我的第一个列表的长度为 27,如下所示:
df_dis = ['heart attack', 'panic disorder', 'bowel cancer' ...]
And my second list has a length of 57 as the follows:我的第二个列表的长度为 57,如下所示:
df_sym = ['chest pain', 'weight loss', 'extreme hand movement'...]
My dataframe (df) is made up of 5 columns as follows, ( I am only showing the first 5 rows):我的 dataframe (df) 由 5 列组成,如下所示,(我只显示前 5 行):
Diseases Symptoms Counts Disease_str Symptoms_str
0 4464711 4831330 5289 heart attack chest pain
1 4147316 4402204 374 bowel obstructive cancer weight loss
2 4317917 4317917 510 panic disorder weight loss
3 4012264 5046090 1154 COPD panic attack
4 4819042 5136449 121 heart attack memory loss
The shape of this df is (18518404, 5).这个df的形状是(18518404, 5)。 This df will contain repeats of the events in both lists but also, they may contain one, two, all the words or additional words, so I am trying to pick up as many of those words (using the lists) to find how many times the events co-occur.
此 df 将包含两个列表中事件的重复,但它们可能包含一个、两个、所有单词或其他单词,因此我试图选择尽可能多的这些单词(使用列表)来查找多少次事件同时发生。
What I did next to find the co-occurring events, I iterated over the dataframe's columns, Disease_str and Symptoms_str given the two lists, to get the.value_counts() from the Counts columns, as follows:我接下来要查找同时发生的事件,我在给定两个列表的情况下迭代了数据框的列、Disease_str 和Symbols_str,以从 Counts 列中获取 the.value_counts(),如下所示:
for i, j in map(df_dis, df_sys):
val_counts_ = df['Counts'][(df['Disease_str'] == df_dis[i]) & (df['Symptoms_str'] == df_sys[j])].value_counts()
I am using the operand &, because I want the intersection rather than the union |.我正在使用操作数 &,因为我想要交集而不是并集 |。
However, I get an error message:但是,我收到一条错误消息:
TypeError: 'list' object is not callable
I have also tried zip(df_dis, df_sys)
but instead, I still get an error message.我也尝试过
zip(df_dis, df_sys)
但我仍然收到一条错误消息。 This time it is a TypeError: list indices must be integers or slices, not str
.这次是
TypeError: list indices must be integers or slices, not str
。
What I would like to obtain is a csv file, that shows the combination of i & j in one column, the counts, and the total from the number of times i & j co-occurred?我想获得的是一个 csv 文件,它显示 i 和 j 在一列中的组合、计数以及 i 和 j 共同发生的次数的总数?
I would appreciate any help and since I am new to programming and pandas, I would also appreciate any explanations so I can jot them down in my notebook so I can try to understand them better.我会很感激任何帮助,因为我是编程和 pandas 的新手,我也很感激任何解释,所以我可以把它们记在笔记本上,这样我就可以更好地理解它们。
Thank you for the help.感谢您的帮助。
you can create a mask where the column Disease_str isin
the list df_dys and same with Symptoms_str column.您可以创建一个掩码,其中列
isin
位于列表 df_dys 中,并且与症状_str 列相同。 Then you filter the rows with this mask, you groupby
the two columns and agg
on the column Counts to get the count
and the sum
.然后使用此掩码过滤行,按两列分组并在
groupby
列上进行agg
以获得count
和sum
。 Now to get all the possible combinations from your two lists, you can reindex
with the MultiIndex.from_product
of the two lists.现在要从两个列表中获取所有可能的组合,您可以使用两个列表的
reindex
MultiIndex.from_product
m = df['Disease_str'].isin(df_dis) & df['Symptoms_str'].isin(df_sym)
df_ = (df[m].groupby(['Disease_str', 'Symptoms_str'])
['Counts'].agg(['count','sum']) #or just ['Counts'].size() if you don't care of the sum
.reindex(pd.MultiIndex.from_product([df_dis, df_sym],
names=['Disease_str', 'Symptoms_str']),
fill_value=0)
.reset_index()
)
print (df_)
Disease_str Symptoms_str count sum
0 heart attack chest pain 1 5289
1 heart attack weight loss 0 0
2 heart attack extreme hand movement 0 0
3 panic disorder chest pain 0 0
4 panic disorder weight loss 1 510
5 panic disorder extreme hand movement 0 0
6 bowel cancer chest pain 0 0
7 bowel cancer weight loss 0 0
8 bowel cancer extreme hand movement 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.