简体   繁体   中英

Python 2.7: Create a new df column with substring from string in another column

I have a pandas dataframe, and I would like to create a new column with a substring of a string contained in a column.

For eg. "race" column contains the word "2016_Lap_JAPANESE_Third_Times.csv", i would like to extract the word 'japanese'.

An approach i am taking now is to compare if the word is in a list, if yes, inpute that value to the new column.

race_names = ['japanese'] -> i have along list of elements in this listand and multiple names in "race" column.

    for i,row in df_fp2.iterrows():
        for name in race_names:
            if name in df_fp2.loc[i,'race']:
                df_fp2.loc[i,'name'] = str(name) + " Grand Prix"

Df converted to dictionary.

{'driverRef': {151: 'button',
  152: 'button',
  153: 'button',
  154: 'button',
  155: 'button'},
 'driver_no': {151: 22, 152: 22, 153: 22, 154: 22, 155: 22},
 'milliseconds': {151: 1339994.0,
  152: 692245.0,
  153: 96286.0,
  154: 94547.999999999985,
  155: 114725.0},
 'name': {151: 'J.BUTTON',
  152: 'J.BUTTON',
  153: 'J.BUTTON',
  154: 'J.BUTTON',
  155: 'J.BUTTON'},
 'race': {151: '2016_Lap_JAPANESE_Third_Times.csv',
  152: '2016_Lap_JAPANESE_Third_Times.csv',
  153: '2016_Lap_JAPANESE_Third_Times.csv',
  154: '2016_Lap_JAPANESE_Third_Times.csv',
  155: '2016_Lap_JAPANESE_Third_Times.csv'},
 'time': {151: 1339.9939999999999,
  152: 692.245,
  153: 96.286000000000001,
  154: 94.547999999999988,
  155: 114.72499999999999}}

This is an array of unique elements in "race" column of df, as the arrangement of words are different, i cannot simply strip the words in front and behind each country name.

array(['2016_Lap_ABU_Third_Times.csv', '2016_Lap_BRASIL_Third_Times.csv',
       '2016_Lap_CHINESE_Third_Times.csv',
       '2016_Lap_JAPANESE_Third_Times.csv',
       '2016_Lap_MAGYAR_Third_Times.csv',
       '2016_Lap_SINGAPORE_Third_Times.csv', '2016_Lap_Third_Times.csv',
       '2016_Lap_UNITED_Third_Times.csv',
       'AUSTRALIAN_2016_Lap_Third_Times.csv',
       'BAHRAIN_2016_Lap_Third_Times.csv',
       'BELGIAN_2016_Lap_Third_Times.csv',
       'CANADA_2016_Lap_Third_Times.csv',
       'ESPANA_2016_Lap_Third_Times.csv',
       'EUROPE_2016_Lap_Third_Times.csv',
       'MALAYSIA_2016_Lap_Third_Times.csv',
       'Mexico_2016_Lap_Third_Times.csv',
       'RUSSIAN_2016_Lap_Third_Times.csv'], dtype=object)

If in race_names are all possible extracted words use str.extract :

import re
race_names = ['japanese']
pat = '|'.join(r"{}".format(x) for x in race_names)
df['name'] = df['race'].str.extract('('+ pat + ')', expand=False, flags=re.I) + " Grand Prix"
print (df)
    driverRef  driver_no  milliseconds                 name  \
151    button         22     1339994.0  JAPANESE Grand Prix   
152    button         22      692245.0  JAPANESE Grand Prix   
153    button         22       96286.0  JAPANESE Grand Prix   
154    button         22       94548.0  JAPANESE Grand Prix   
155    button         22      114725.0  JAPANESE Grand Prix   

                                  race      time  
151  2016_Lap_JAPANESE_Third_Times.csv  1339.994  
152  2016_Lap_JAPANESE_Third_Times.csv   692.245  
153  2016_Lap_JAPANESE_Third_Times.csv    96.286  
154  2016_Lap_JAPANESE_Third_Times.csv    94.548  
155  2016_Lap_JAPANESE_Third_Times.csv   114.725  

Maybe is possible also use replace and str.strip :

df = pd.DataFrame({'race':['2016_Lap_ABU_Third_Times.csv', '2016_Lap_BRASIL_Third_Times.csv',
       '2016_Lap_CHINESE_Third_Times.csv',
       '2016_Lap_JAPANESE_Third_Times.csv',
       '2016_Lap_MAGYAR_Third_Times.csv',
       '2016_Lap_SINGAPORE_Third_Times.csv', '2016_Lap_Third_Times.csv',
       '2016_Lap_UNITED_Third_Times.csv',
       'AUSTRALIAN_2016_Lap_Third_Times.csv',
       'BAHRAIN_2016_Lap_Third_Times.csv',
       'BELGIAN_2016_Lap_Third_Times.csv',
       'CANADA_2016_Lap_Third_Times.csv',
       'ESPANA_2016_Lap_Third_Times.csv',
       'EUROPE_2016_Lap_Third_Times.csv',
       'MALAYSIA_2016_Lap_Third_Times.csv',
       'Mexico_2016_Lap_Third_Times.csv',
       'RUSSIAN_2016_Lap_Third_Times.csv']})

df['name'] = (df['race'].replace(['_Third_Times.csv','Lap', '\d+'], '', regex=True)
                        .str.strip('_'))
print (df)
                                   race        name
0          2016_Lap_ABU_Third_Times.csv         ABU
1       2016_Lap_BRASIL_Third_Times.csv      BRASIL
2      2016_Lap_CHINESE_Third_Times.csv     CHINESE
3     2016_Lap_JAPANESE_Third_Times.csv    JAPANESE
4       2016_Lap_MAGYAR_Third_Times.csv      MAGYAR
5    2016_Lap_SINGAPORE_Third_Times.csv   SINGAPORE
6              2016_Lap_Third_Times.csv            
7       2016_Lap_UNITED_Third_Times.csv      UNITED
8   AUSTRALIAN_2016_Lap_Third_Times.csv  AUSTRALIAN
9      BAHRAIN_2016_Lap_Third_Times.csv     BAHRAIN
10     BELGIAN_2016_Lap_Third_Times.csv     BELGIAN
11      CANADA_2016_Lap_Third_Times.csv      CANADA
12      ESPANA_2016_Lap_Third_Times.csv      ESPANA
13      EUROPE_2016_Lap_Third_Times.csv      EUROPE
14    MALAYSIA_2016_Lap_Third_Times.csv    MALAYSIA
15      Mexico_2016_Lap_Third_Times.csv      Mexico
16     RUSSIAN_2016_Lap_Third_Times.csv     RUSSIAN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM