简体   繁体   中英

How to extract data from a pandas dataframe based upon values of other columns?

I have a df=

A= 
[period store  item
1        32     'A'
1        34     'A'
1        32     'B'
1        34     'B'
2        42     'X'
2        44     'X'
2        42     'Y'
2        44     'Y'] 

I want to find all the stores for an item in that period preferably in a dictionary like this:

dicta = {1: {'A': (32, 34),'B': (32, 34)}, 2: {'X': (42, 44),'Y': (42, 44)}}

EDIT For @JEZRAEL

Actual df
       RTYPE  PERIOD_ID  STORE_ID                            MKT MTYPE  RGROUP  RZF  RXF
0        MKT        317     13178                      Kiosks_11  CELL     NaN  NaN  NaN
1        MKT        306     11437                      Kiosks_11  CELL     NaN  NaN  NaN
2        MKT        306     12236                      Kiosks_11  CELL     NaN  NaN  NaN
3        MKT        312     11024                      Kiosks_11  CELL     NaN  NaN  NaN
4        MKT        307     13010                      Kiosks_11  CELL     NaN  NaN  NaN
5        MKT        307     12723                      Kiosks_11  CELL     NaN  NaN  NaN
6        MKT        306     14218                      Kiosks_11  CELL     NaN  NaN  NaN
7        MKT        306     13547                      Kiosks_11  CELL     NaN  NaN  NaN
8        MKT        316     12396                      Kiosks_11  CELL     NaN  NaN  NaN
9        MKT        306     10778                      Cafes_638  CELL     NaN  NaN  NaN
10       MKT        317     11230                      Kiosks_11  CELL     NaN  NaN  NaN
11       MKT        315     13630                      Kiosks_11  CELL     NaN  NaN  NaN
12       MKT        314     14113                        Bars_13  CELL     NaN  NaN  NaN
13       MKT        314     12089                      Kiosks_11  CELL     NaN  NaN  NaN

Here, PERIOD_ID AND STORE_ID and MKT are periods,stores and items respectively. The edit suggested by @jezrael is returning me this for the above df.

d1={306L: (8207L, 8209L .... 8210L, 8211L),307L:( 8215L, 8219L ... 8233L, 8235L), 308: (8238L, 8239L....8244L, 8252L) ..k:(v) ..}

(Note: Edited to make it look small as the original dictionary is huge)

For the sample data it is working fine as expected but for this dataframe it isnt.

Edit for @jezrael as a Minimal, Reproducible Example.

df=

   RTYPE  PERIOD_ID    STORE_ID                       MKT MTYPE  RGROUP  RZF  RXF
0    MKT   20171411  3102300001  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
1    MKT   20171411  3102300002  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
2    MKT   20171411  3104001193              PM Provision  CELL     NaN  NaN  NaN
3    MKT   20171411  3104001193  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
4    MKT   20171411  3104001193    Provision including MM  CELL     NaN  NaN  NaN
5    MKT   20171411  3104001641              PM Provision  CELL     NaN  NaN  NaN
6    MKT   20171411  3104001641  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
7    MKT   20171411  3104001641    Provision including MM  CELL     NaN  NaN  NaN
8    MKT   20171411  3104001682              PM Provision  CELL     NaN  NaN  NaN
9    MKT   20171411  3104001682  PM KA+PM PROV+SMKT+PETRO  CELL     NaN  NaN  NaN
10   MKT   20171411  3104001682    Provision including MM  CELL     NaN  NaN  NaN
11   MKT   20171412  3104001682                   Alcohol  CELL     NaN  NaN  NaN
12   MKT   20171412  3104001682                      Fish  CELL     NaN  NaN  NaN
13   MKT   20171412  3104001684                   Alcohol  CELL     NaN  NaN  NaN
14   MKT   20171412  3104001684                      Fish  CELL     NaN  NaN  NaN

Current Ouput as per @jezraels code

{20171411L: ('Provision including MM', 'PM Provision', 'PM KA+PM PROV+SMKT+PETRO'), 20171412L: ('Fish', 'Alcohol')}

Expected Output :

{20171411L: ('Provision including MM', 'PM Provision'), 20171412L: ('Fish', 'Alcohol')}

For Period 20171411L , 'Provision including MM', 'PM Provision' MKT's are duplicate because they have the same set of store_ids whereas for period 20171412L , 'Fish', 'Alcohol' MKT's are duplicate because they have the same set of store_ids.

I am new to Pandas but have some basic knowledge about Python. Really not sure how I can achieve this. Any help will be great.

Create MultiIndex Series and in dictionary comprehension create nested dictionary:

s = df.groupby(['period','item'])['store'].apply(tuple)

d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
print (d)
{1: {'A': (32, 34), 'B': (32, 34)}, 2: {'X': (42, 44), 'Y': (42, 44)}}

EDIT: You can grouping by period and convert item to sets and then to tuples:

d1 = {k:tuple(set(v)) for k, v in df.groupby('period')['item']}
print (d1)
{1: ('A', 'B'), 2: ('X', 'Y')}

d1 = df.groupby('period')['item'].apply(lambda x: tuple(set(x))).to_dict()
print (d1)
{1: ('A', 'B'), 2: ('X', 'Y')}

You can do with a dict comprehension:

dicta = {p: g.groupby('item')['store'].apply(tuple).to_dict()
         for p, g in df.groupby('period')}

[out]

{1: {"'A'": (32, 34), "'B'": (32, 34)}, 2: {"'X'": (42, 44), "'Y'": (42, 44)}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM