简体   繁体   中英

set analysis: create pandas series with intersections as index and values as counts

I've tried and tried, all day to try and make this work and it's starting to make me angry! All I want to do is create a necessary pandas series for input into upsetplot as detailed here:

https://pypi.org/project/upsetplot/

I don't understand how the generate_data function is manipulating its sets to make a series. I would have assumed that there was a simple way to do this by calling set(), but I can't seem to find it.

So I instead began manipulating my dataframes directly but suspected the attempts were misguided.

Thus I resort to providing a simple dataframe below and pray that some kind soul can enlighten me.

import pandas as pd
from matplotlib import pyplot as plt
from upsetplot import generate_data, plot

df = pd.DataFrame({'john':[1,2,3,5,7,8],
              'jerry':[1,2,5,7,9,2],
              'josie':[2,2,3,2,5,6],
              'jean':[6,5,7,6,2,4]})

df = pd.DataFrame({'john':[True,False,True,False,True,False],
              'jerry':[True,True,False,True,False,True],
              'josie':[True,False,False,True,False,False],
              'jean':[True,False,False,True,False,False],
              'food':['apple','carrot','choc','bread','ham','nut']})

the example from the package home

from upsetplot import generate_data
example = generate_data(aggregated=True)
example  # doctest: +NORMALIZE_WHITESPACE
set0   set1   set2
False  False  False      56
              True      283
       True   False    1279
              True     5882
True   False  False      24
              True       90
       True   False     429
              True     1957
Name: value, dtype: int64

Aggregate count by GroupBy.size with all columns without food :

df = pd.DataFrame({'john':[True,False,True,False,True,False],
              'jerry':[True,True,False,True,False,True],
              'josie':[True,False,False,True,False,False],
              'jean':[True,False,False,True,False,False],
              'food':['apple','carrot','choc','bread','ham','nut']})

cols = df.columns.difference(['food']).tolist()
s = df.groupby(cols).size()
print (s)
jean   jerry  john   josie
False  False  True   False    2
       True   False  False    2
True   True   False  True     1
              True   True     1
dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM