My first dataframe contains a column consisting of a unique ID of a card (card_id):
treino.head(5)
card_id feature_1 feature_2 feature_3
0 C_ID_92a2005557 5 2 1
1 C_ID_3d0044924f 4 1 0
2 C_ID_d639edf6cd 2 2 0
3 C_ID_186d6a6901 4 3 0
4 C_ID_cdbd2c0db2 1 3 0
my second dataframe is the history of where these cards were passed:
df2.head(5)
authorized_flag card_id city_id category_1 merchant_id
0 Y C_ID_92a2005557 88 N M_ID_e020e9b302
1 Y C_ID_d639edf6cd 88 N M_ID_86ec983688
2 Y C_ID_92a2005557 88 N M_ID_979ed661fc
3 Y C_ID_92a2005557 88 N M_ID_e6d5ae8ea6
4 Y C_ID_92a2005557 88 N M_ID_e020e9b302
5 Y C_ID_4e6213e9bc 333 N M_ID_50af771f8d
6 Y C_ID_92a2005557 88 N M_ID_5e8220e564
7 Y C_ID_4e6213e9bc 3 N M_ID_9d41786a50
8 Y C_ID_d639edf6cd 88 N M_ID_979ed661fc
when using:
merged_left = pd.merge (left = df1, right = df2, how = left, left_on = 'card_id', right_on = 'card_id')
it multiplies the lines of card_id because in the second dataframe a card_id appears several times. I already put it to do the join on the left to just leave the card_id uniquely of the first dataframe but my problem continues.
I already understood that it multiplies the lines because df2 is a shopping history and the card_id appear several times but I can not solve it.
already tried something like:
df2.groupby (['card_id', 'merchant_id']). size (). reset_index ()
but I still have several rows of the same card_id, could they help me to create a dataframe with only 1 line of each unique card_id and merchant_id, will I have to create a new variable with their data summarized?
If you want just a list of card_id / merchant_id (which user has bought something from which merchant), it is enough to draw data from df2 :
df2[['card_id', 'merchant_id']].drop_duplicates()
As you can see, no groupby is needed, just read the columns in question and drop duplicates.
A little more complex case is when you want eg how many times particular card_id has bought something from particular merchant_id . Then groupby is needed and the value wanted you will get using size() function:
df2.groupby(['card_id', 'merchant_id']).size()
possibly completed with .reset_index() as you did.
Of course, particular card_id occurs in several output row, but each time with different merchant_id (and relevant number of transactions between these 2 subjects).
So make up your mind what information you want besides card_id and merchant_id . This is necessary to decide what code is needed to generate the answer.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.