简体   繁体   中英

Aggregate and Convert categorical data to numbers

I have a data frame df_train which has a column sub_division.

The values in the column is look like below

ABC_commercial,
ABC_Private,
Test ROM DIV,
ROM DIV,
TEST SEC R&OM

I am trying to 1. convert anything starts with ABC* to a number (for ex: 1) 2. convert anything contains ROM and R&OM to a number (for ex: 2)

Thanks in advance.

Expected result:

1,
1,
2,
2,
2

Use numpy.select with Series.str.startswith and Series.str.contains :

m1 = df['col'].str.startswith('ABC')
m2 = df['col'].str.contains('ROM|R&OM')

df['new'] = np.select([m1, m2], [1,2], default='no match')
#if need all numbers
#df['new'] = np.select([m1, m2], [1,2], default=0)
print (df)
               col new
0  ABC_commercial,   1
1     ABC_Private,   1
2    Test ROM DIV,   2
3         ROM DIV,   2
4    TEST SEC R&OM   2

You can do something like below. Remember you will get NaN if there is no match. You can add else case in the converter function to get default value.

def converter(v):
    if v.startswith('ABC'):
        return 1
    elif any(i in v for i in ['ROM', 'R&OM']):
        return 2

df['sub_division'] = df['sub_division'].apply(converter)
print(df.head(10))

output:

   sub_division
0             1
1             1
2             2
3             2
4             2

You can use:

df.loc[df['col'].str.startswith('ABC'), 'col'] = 1
df.loc[df['col'].str.contains(r'ROM|R&OM', na=False), 'col'] = 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM