简体   繁体   中英

How to create a dataframe from a python groupby

My first dataframe contains a column consisting of a unique ID of a card (card_id):

treino.head(5)

            card_id   feature_1   feature_2   feature_3
0   C_ID_92a2005557           5         2             1
1   C_ID_3d0044924f           4         1             0
2   C_ID_d639edf6cd           2         2             0
3   C_ID_186d6a6901           4         3             0
4   C_ID_cdbd2c0db2           1         3             0

my second dataframe is the history of where these cards were passed:

df2.head(5)

   authorized_flag          card_id  city_id category_1      merchant_id
0                Y  C_ID_92a2005557       88          N  M_ID_e020e9b302
1                Y  C_ID_d639edf6cd       88          N  M_ID_86ec983688
2                Y  C_ID_92a2005557       88          N  M_ID_979ed661fc
3                Y  C_ID_92a2005557       88          N  M_ID_e6d5ae8ea6
4                Y  C_ID_92a2005557       88          N  M_ID_e020e9b302
5                Y  C_ID_4e6213e9bc      333          N  M_ID_50af771f8d
6                Y  C_ID_92a2005557       88          N  M_ID_5e8220e564
7                Y  C_ID_4e6213e9bc        3          N  M_ID_9d41786a50
8                Y  C_ID_d639edf6cd       88          N  M_ID_979ed661fc

when using:

merged_left = pd.merge (left = df1, right = df2, how = left, left_on = 'card_id', right_on = 'card_id')

it multiplies the lines of card_id because in the second dataframe a card_id appears several times. I already put it to do the join on the left to just leave the card_id uniquely of the first dataframe but my problem continues.

I already understood that it multiplies the lines because df2 is a shopping history and the card_id appear several times but I can not solve it.

already tried something like:

df2.groupby (['card_id', 'merchant_id']). size (). reset_index ()

but I still have several rows of the same card_id, could they help me to create a dataframe with only 1 line of each unique card_id and merchant_id, will I have to create a new variable with their data summarized?

If you want just a list of card_id / merchant_id (which user has bought something from which merchant), it is enough to draw data from df2 :

df2[['card_id', 'merchant_id']].drop_duplicates()

As you can see, no groupby is needed, just read the columns in question and drop duplicates.

A little more complex case is when you want eg how many times particular card_id has bought something from particular merchant_id . Then groupby is needed and the value wanted you will get using size() function:

df2.groupby(['card_id', 'merchant_id']).size()

possibly completed with .reset_index() as you did.

Of course, particular card_id occurs in several output row, but each time with different merchant_id (and relevant number of transactions between these 2 subjects).

So make up your mind what information you want besides card_id and merchant_id . This is necessary to decide what code is needed to generate the answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM