简体   繁体   中英

Matching two pandas series: How to find a string element from one series in another series and then create a new column

I am currently working on cleaning up a car emissions data set. This is what the data set looks like (only included first 10 rows):

import pandas as pd

cars_em_df = pd.DataFrame({'manufacturer_name_mapped': ['FIAT', 'FIAT','FIAT','FIAT','FIAT'],
'commercial_name':['124 gt multiair auto', '500l wagon pop star t-jet', 
'doblo combi 1.4 95', 'panda  0.9t sge 85 natural power', 'punto 1.4  77 lpg'],
'fuel_type_mapped':['Petrol', 'Petrol', 'Petrol', 'NG-Biomethane', 'LPG'],
'file_year':[2018, 2018, 2018, 2018, 2018], 'emissions': [153,158,165,86,114]})

I am mostly interested in column 'commercial_name' . The end-goal is to add another column to this dataframe that shows the 'cleaned up' version of 'commercial_name' . I have a separate pandas series that contains the 'correct' names that should be used instead of these 'messy' names.

real_model_names = pd.Series(['uno', '147', 'panda', 'punto', '166', '4c', 'brera', 'giulia',
'giulietta', 'gtv'])

These are all strings as well. So as an example, I would like to look up in every row of 'commercial_name' whether it contains any of the names from the 'real_model_names series' . Eg 'punto' from 'real_model_names' can be found in the entry 'punto 1.4 77 lpg' from the 'commercial_name' column. So then I would like (in a new column in car_em_df) to have 'punto' next to it. If it cannot be found, I would like the original 'messy' name to be shown.

I tried to define a function that I would then apply along the 'commercial_name' column. I tried this:

def str_ops(series):
   for i in real_model_names:
      if i in series:
         return series.replace(series, i)
      else:
         return series

And as a next step I would apply this function and add it to the dataframe as a new column:

commercial_name_cleaned = cars_em_df.commercial_name.apply(str_ops)
cars_em_df.insert(3,value=commercial_name_cleaned,column='commercial_name_cleaned') 

However, this just doesn't do anything. The new column just shows the exact same entries as 'commercial_name'.

Does anyone know how to solve this problem? Is there a better way to do this?

Thanks a lot in advance!

Your loop was on the right track. The most readable and direct way I can think of to do this:

def str_ops(x):
    for y in real_model_names: 
        if y in x: 
            return y 
    return x

cars_em_df['commercial_name_cleaned'] = cars_em_df['commercial_name'].apply(str_ops)

# Result
cars_em_df
  manufacturer_name_mapped                   commercial_name fuel_type_mapped  file_year  emissions    commercial_name_cleaned
0                     FIAT              124 gt multiair auto           Petrol       2018        153       124 gt multiair auto
1                     FIAT         500l wagon pop star t-jet           Petrol       2018        158  500l wagon pop star t-jet
2                     FIAT                doblo combi 1.4 95           Petrol       2018        165         doblo combi 1.4 95
3                     FIAT  panda  0.9t sge 85 natural power    NG-Biomethane       2018         86                      panda
4                     FIAT                 punto 1.4  77 lpg              LPG       2018        114                      punto

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM