简体   繁体   中英

Python Pandas co-occurrence after groupby

I would like to compute the co-occurrence percentages after grouping. I am unable to determine the best method for doing so. I can think of ways to brute-force the answers, but this means lots of hard-coded calculations that may break as more source data is added. There must be a more elegant method, but I don't see it. I appreciate any suggestions.

(perhaps a little similar to Python Pandas check if a value occurs more then once in the same day )

Goal: Table of co-occurrence percentages for a data column after grouping. For example: When A occurred, B was found with A 45% of the time in January. When A occurred, C was found with A 21% of the time for week 6.

Sample Data (df):

Date        ID      Region  Event
1/01/2016   1001    S       C
1/01/2016   1001    S       D
1/01/2016   1001    N       E
1/01/2016   1002    E       D
1/02/2016   1003    E       A
1/04/2016   1005    N       B
1/04/2016   1005    N       B
1/04/2016   1005    N       B
1/04/2016   1006    N       A
1/04/2016   1006    N       F
2/12/2016   1008    E       C
2/12/2016   1008    E       B

To calculate the percentages, I need to find Events that happen in with the same ID. So, for the whole dataset C when B is 50%, B isolated is 50% and all others are 0%. But, if i groupby Month, then B isolated is 100% for Jan, and C when B is 100% for Feb.

Currently, I have code using .isin and .drop_duplicates to find and reduce the lists:

b_ids = df[df.Event == 'B'].ID.drop_duplicates()
x = len(b_ids)
c_when_b = df[(df.ID.isin(b_ids)) & (df.Event == 'C')].ID.drop_duplicates()
y = len(c_when_b)
pct_cb = float(x)/y

Problems:

  • How can this be extended to all binary combinations of Events (the real data has 25 events)
  • How do I modify this for easy grouping by date (week, month, quarter, etc.)?
  • How can the Region also be a grouping?
  • How can it easily be extended to multiple criteria ( (A | B) & (C | D) )?
  • Is there something easy that I'm completely missing? Please let me know if this is unclear. Thanks in advance.

EDIT: Expected output would be a multiple column series for each event for a given time grouping for plotting (ignore these actual numbers):

EVENT A
      A       B     C     ...
1     96.19   1.23  2.22
2     96.23   1.56  1.12
3     95.24   2.58  3.02
4     78.98   20.31 1.11
...   ....    ...   ...

EVENT B
      A       B      C    ...
1     96.19   1.23   3.33
2     96.23   1.56   1.08
3     95.24   2.58   1.78
4     78.98   20.31  5.12
...   ....    ...    ...

I think you want crosstabs:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html

This will give you just the raw frequencies. You can then divide each cell by the total number number of occurrences to get joint probabilities.

EDIT: I'm reading your question more thoroughly, and I think you're going to need to do a lot of data wrangling beyond just tossing pd.crosstabs at your original dataset. For example, you'll probably want to create a new column df['Week'], which is just a value 1-#ofWeeks based on df['Date'].

But this question is a little old, so maybe you already figured this out.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM