简体   繁体   中英

Find count of unique column elements after using groupby with pandas

I have a data set that is setup like the following:

rows = [
    ('us', 0, 'ca', None, 94107, -100),
    ('ca', 1, None, 'bc', 94107, -100),
    ('us', 0, 'ca', None, 94106, 0),
    ('us', 0, 'ca', None, 94107, 0),
    ('ca', 1, None, 'bc', 94107, 0),
    ('ca', 1, None, 'bc', 94107, 0),
    ('us', 0, 'ca', None, 94107, 100),
    ('us', 0, 'ca', None, 94107, 100)
]

I want to group by: (country, state/provence, zip) and then find the counts of the Option column AFTER the grouping is completed, and then finally convert to a dict.

Ideally I would like the dict to be formatted as such:

{
    ('us', 'ca', 94107): {100: 2, -100: 1, 0: 1}, 
    ('us', 'ca', 94106): {0: 1},  
    ('ca', 'bc', 94107): {-100: 1, 0: 2}
}

I have the following code so far:

# build the data frame
df = pd.DataFrame(rows, columns=['Country', 'LocFilter', 'State', 'Provence', 'Zip', 'Option'])

# consolidate "State" and "Provence" into "MainProvence" based on "LocFilter"
df['MainProvence'] = df.apply(lambda row: (row['Provence'] if row['LocFilter'] == 1 else row['State']), axis=1)

# group by and find distribution
distribution = df.groupby(by=['Country', 'MainProvence','Zip', 'Option'])['Option'].count()
# print the result
print distribution

This gives me the following - which looks good:

Country  MainProvence  Zip    Option
ca       bc            94107  -100      1
                               0        2
us       ca            94106   0        1
                       94107  -100      1
                               0        1
                               100      2
Name: Option, dtype: int64

However, when I convert this to a dict:

print distribution.to_dict()

I get this:

{
    ('us', 'ca', 94107, 100): 2, 
    ('us', 'ca', 94106, 0): 1, 
    ('us', 'ca', 94107, -100): 1, 
    ('ca', 'bc', 94107, 0): 2, 
    ('ca', 'bc', 94107, -100): 1, 
    ('us', 'ca', 94107, 0): 1
}

Which is understandable based on how I formed the groupby. I could obviously manipulate the returned dict in python to get the format that I want - but is there any way to get this format using pandas?

This is super easy. Try:

distribution.unstack(level=['Option']).to_dict(orient='index')

To get

{('ca', 'bc', 94107): {-100: 1.0, 0: 2.0, 100: nan},
 ('us', 'ca', 94106): {-100: nan, 0: 1.0, 100: nan},
 ('us', 'ca', 94107): {-100: 1.0, 0: 1.0, 100: 2.0}}

I think dropping the nan s shouldn't be too much of an inconvenience at this point.


PS. Consider using:

df['MainProvence'] = df['State'].fillna(df['Provence'])

in place of

df['MainProvence'] = df.apply(lambda row: (row['Provence'] if row['LocFilter'] == 1 else row['State']), axis=1)

PPS. You will need Pandas 0.17 for the orient kwarg to work inside to_dict() .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM