简体   繁体   中英

How to fill in values for a column in a dataframe by matching values from another dataframe pandas

I'm new to python and am working with the kaggle titanic dataset to practice.

I'm trying to fill in a couple missing values for the cabin feature by using rows that have the same tickets. That is, I want to get a list of duplicate tickets and their corresponding cabin value and replace the null values with the cabin values corresponding to the same ticket.

In my approach, I created a dataframe with the following code consisting of only one occurrence of the duplicate ticket(given that the ticket had a cabin value to go along with it; is non-null) to assign it a single cabin value. This way I could fill in the cabin values in the training set(maindf) by matching.

ticket_dupl = maindf[(maindf.duplicated('Ticket')) & (maindf['Cabin'].notnull())][['Ticket','Cabin']].drop_duplicates('Ticket')

This gives me a dataframe of length 50 with index perserved, heres the first 7 rows:

    Ticket  Cabin
88  19950   C23 C25 C27
124 35281   D26
137 113803  C123
193 230080  F2
195 PC 17569 B80
230 36973   C83
251 347054  G6

Is there a way to fill in some cabin values in my maindf by matching ticket rows or indices, preserving the values for which tickets don't match? Can't seem to understand from other solutions for questions similar to mine.

Also, I was wondering if there was a more efficient way of achieving my goal instead of creating a dataframe like I did. Thanks.

您可以按故障单分组以将具有匹配故障单的行组合在一起,并使用返回组中第一个非空值的 first_valid_index 填充空值。

df['Cabin'] = df.groupby('Ticket')['Cabin'].transform(lambda x: x.loc[x.first_valid_index()])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM