I am currently working on cleaning up a car emissions data set. This is what the data set looks like (only included first 10 rows):
import pandas as pd
cars_em_df = pd.DataFrame({'manufacturer_name_mapped': ['FIAT', 'FIAT','FIAT','FIAT','FIAT'],
'commercial_name':['124 gt multiair auto', '500l wagon pop star t-jet',
'doblo combi 1.4 95', 'panda 0.9t sge 85 natural power', 'punto 1.4 77 lpg'],
'fuel_type_mapped':['Petrol', 'Petrol', 'Petrol', 'NG-Biomethane', 'LPG'],
'file_year':[2018, 2018, 2018, 2018, 2018], 'emissions': [153,158,165,86,114]})
I am mostly interested in column 'commercial_name' . The end-goal is to add another column to this dataframe that shows the 'cleaned up' version of 'commercial_name' . I have a separate pandas series that contains the 'correct' names that should be used instead of these 'messy' names.
real_model_names = pd.Series(['uno', '147', 'panda', 'punto', '166', '4c', 'brera', 'giulia',
'giulietta', 'gtv'])
These are all strings as well. So as an example, I would like to look up in every row of 'commercial_name' whether it contains any of the names from the 'real_model_names series' . Eg 'punto' from 'real_model_names' can be found in the entry 'punto 1.4 77 lpg' from the 'commercial_name' column. So then I would like (in a new column in car_em_df) to have 'punto' next to it. If it cannot be found, I would like the original 'messy' name to be shown.
I tried to define a function that I would then apply along the 'commercial_name' column. I tried this:
def str_ops(series):
for i in real_model_names:
if i in series:
return series.replace(series, i)
else:
return series
And as a next step I would apply this function and add it to the dataframe as a new column:
commercial_name_cleaned = cars_em_df.commercial_name.apply(str_ops)
cars_em_df.insert(3,value=commercial_name_cleaned,column='commercial_name_cleaned')
However, this just doesn't do anything. The new column just shows the exact same entries as 'commercial_name'.
Does anyone know how to solve this problem? Is there a better way to do this?
Thanks a lot in advance!
Your loop was on the right track. The most readable and direct way I can think of to do this:
def str_ops(x):
for y in real_model_names:
if y in x:
return y
return x
cars_em_df['commercial_name_cleaned'] = cars_em_df['commercial_name'].apply(str_ops)
# Result
cars_em_df
manufacturer_name_mapped commercial_name fuel_type_mapped file_year emissions commercial_name_cleaned
0 FIAT 124 gt multiair auto Petrol 2018 153 124 gt multiair auto
1 FIAT 500l wagon pop star t-jet Petrol 2018 158 500l wagon pop star t-jet
2 FIAT doblo combi 1.4 95 Petrol 2018 165 doblo combi 1.4 95
3 FIAT panda 0.9t sge 85 natural power NG-Biomethane 2018 86 panda
4 FIAT punto 1.4 77 lpg LPG 2018 114 punto
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.