简体   繁体   中英

how to calculate ratio on some condition in pandas dataframe

I have a pandas dataframe like following..

   date                         item_id
2016-01-19                      [188, 188]
2016-01-23                      [188, 142]
2016-02-05                      [188, 264]
2016-02-06  [273, 248, 191, 167, 238, 191]
2016-02-15                           [320]
2016-02-17                           [286]
2016-02-20                      [164, 317]

In the above I want to calculate a ratio of No of times the item_id got repeated on different dates / no of unique item_id So in above scenario item_id 188 repeated 3 times on 3 different days so the ratio will be 3/no of unique item_id 3/13

code to create a dataframe

buyer_id item_id        date
261_23     188  2016-01-19
261_23     188  2016-01-19
261_23     188  2016-01-23
261_23     142  2016-01-23
261_23     188  2016-02-05
261_23     264  2016-02-05
261_23     273  2016-02-06
261_23     248  2016-02-06
261_23     191  2016-02-06
261_23     167  2016-02-06
261_23     238  2016-02-06
261_23     191  2016-02-06
261_23     320  2016-02-15
261_23     286  2016-02-17
261_23     164  2016-02-20
261_23     317  2016-02-20

df.groupby(['date','buyer_id'])['item_id'].apply(lambda x: x.tolist())

The set of the union of all unique items is

unique_items = set().union(*df.item_id.apply(set))

The number of appearances of each item is

num_appearances = [df.item_id.apply(lambda s: k in s).sum() for k in unique_items]

Therefore, the following will create a dictionary mapping each item to the ratio you asked:

dict((k, n / float(len(unique_items))) \
    for (k, n) in zip(unique_items, num_appearances))

Example

import pandas as pd

df = pd.DataFrame({
    'date': range(5), 
    'item_id': [[188, 188], [188, 142], [188, 264], [273, 248, 191, 167, 238, 191], [320]]})

unique_items = set().union(*df.item_id.apply(set))
>>> unique_items
{142, 167, 188, 191, 238, 248, 264, 273, 320}

num_appearances = [df.item_id.apply(lambda s: k in s).sum() for k in unique_items]
>>> num_appearances
[1, 1, 1, 1, 1, 1, 1, 3, 1]

>>> dict((k, n / float(len(unique_items))) \
    for (k, n) in zip(unique_items, num_appearances))
{142: 0.1111111111111111,
 167: 0.1111111111111111,
 188: 0.33333333333333331,
 191: 0.1111111111111111,
 238: 0.1111111111111111,
 248: 0.1111111111111111,
 264: 0.1111111111111111,
 273: 0.1111111111111111,
 320: 0.1111111111111111}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM