Aggregate and Convert categorical data to numbers

Question

I have a data frame df_train which has a column sub_division.

The values in the column is look like below

ABC_commercial,
ABC_Private,
Test ROM DIV,
ROM DIV,
TEST SEC R&OM

I am trying to 1. convert anything starts with ABC* to a number (for ex: 1) 2. convert anything contains ROM and R&OM to a number (for ex: 2)

Thanks in advance.

Expected result:

1,
1,
2,
2,
2

Answer 1

Use numpy.select with Series.str.startswith and Series.str.contains :

m1 = df['col'].str.startswith('ABC')
m2 = df['col'].str.contains('ROM|R&OM')

df['new'] = np.select([m1, m2], [1,2], default='no match')
#if need all numbers
#df['new'] = np.select([m1, m2], [1,2], default=0)
print (df)
               col new
0  ABC_commercial,   1
1     ABC_Private,   1
2    Test ROM DIV,   2
3         ROM DIV,   2
4    TEST SEC R&OM   2

Answer 2

You can do something like below. Remember you will get NaN if there is no match. You can add else case in the converter function to get default value.

def converter(v):
    if v.startswith('ABC'):
        return 1
    elif any(i in v for i in ['ROM', 'R&OM']):
        return 2

df['sub_division'] = df['sub_division'].apply(converter)
print(df.head(10))

output:

   sub_division
0             1
1             1
2             2
3             2
4             2

Answer 3

You can use:

df.loc[df['col'].str.startswith('ABC'), 'col'] = 1
df.loc[df['col'].str.contains(r'ROM|R&OM', na=False), 'col'] = 2

Aggregate and Convert categorical data to numbers

Question

3 answers

solution1
1 2019-06-17 07:04:56

solution2
0 2019-06-17 07:11:27

solution3
0 2019-06-17 08:17:50

Aggregate and Convert categorical data to numbers

Question

3 answers

solution1 1 2019-06-17 07:04:56

solution2 0 2019-06-17 07:11:27

solution3 0 2019-06-17 08:17:50

solution1
1 2019-06-17 07:04:56

solution2
0 2019-06-17 07:11:27

solution3
0 2019-06-17 08:17:50