I'm new to python and am working with the kaggle titanic dataset to practice.
I'm trying to fill in a couple missing values for the cabin feature by using rows that have the same tickets. That is, I want to get a list of duplicate tickets and their corresponding cabin value and replace the null values with the cabin values corresponding to the same ticket.
In my approach, I created a dataframe with the following code consisting of only one occurrence of the duplicate ticket(given that the ticket had a cabin value to go along with it; is non-null) to assign it a single cabin value. This way I could fill in the cabin values in the training set(maindf) by matching.
ticket_dupl = maindf[(maindf.duplicated('Ticket')) & (maindf['Cabin'].notnull())][['Ticket','Cabin']].drop_duplicates('Ticket')
This gives me a dataframe of length 50 with index perserved, heres the first 7 rows:
Ticket Cabin
88 19950 C23 C25 C27
124 35281 D26
137 113803 C123
193 230080 F2
195 PC 17569 B80
230 36973 C83
251 347054 G6
Is there a way to fill in some cabin values in my maindf by matching ticket rows or indices, preserving the values for which tickets don't match? Can't seem to understand from other solutions for questions similar to mine.
Also, I was wondering if there was a more efficient way of achieving my goal instead of creating a dataframe like I did. Thanks.
您可以按故障单分组以将具有匹配故障单的行组合在一起,并使用返回组中第一个非空值的 first_valid_index 填充空值。
df['Cabin'] = df.groupby('Ticket')['Cabin'].transform(lambda x: x.loc[x.first_valid_index()])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.