简体   繁体   中英

Create new columns in pandas data frame based on existing column

I have a Python dictionary as follows:

ref_dict = {
"Company1" :["C1_Dev1","C1_Dev2","C1_Dev3","C1_Dev4","C1_Dev5",],
"Company2" :["C2_Dev1","C2_Dev2","C2_Dev3","C2_Dev4","C2_Dev5",],
"Company3" :["C3_Dev1","C3_Dev2","C3_Dev3","C3_Dev4","C3_Dev5",],
 }

I have a Pandas data frame called df whose one of the columns looks like this:

    DESC_DETAIL
0   Probably task Company2 C2_Dev5
1   File system C3_Dev1
2   Weather subcutaneous Company2
3   Company1 Travesty C1_Dev3
4   Does not match anything 
...........

My goal is to add two extra columns to this data frame and name the columns, COMPANY and DEVICE . The value in each row of the COMPANY column will be either be the company key in the dictionary if it exists in the DESC_DETAIL column or if the corresponding device exists in the DESC_DETAIL column. The value in the DEVICE column will simply be the device string in the DESC_DETAIL column. If no match is found, the corresponding row is empty. Hence the final output will look like this:

     DESC_DETAIL                        COMPANY         DEVICE
 0   Probably task Company2 C2_Dev5     Company2        C2_Dev5
 1   File system C3_Dev1                Company3        C3_Dev1
 2   Weather subcutaneous Company2      Company2        NaN
 3   Company1 Travesty C1_Dev3          Company1        C1_Dev3
 4   Does not match anything            NaN             NaN

My attempt:

for key, value in ref_dict.items():
    df['COMPANY'] = df.apply(lambda row: key if row['DESC_DETAIL'].isin(key) else Nan, axis=1)

This is obviously just wrong and does not work. How do I make it work?

You can extract values with str.extract using a regex pattern:

import re

s = pd.Series(ref_dict).explode()

# extract company
df['COMPANY'] = df['DESC_DETAIL'].str.extract(
    f"({'|'.join(s.index.unique())})", flags=re.IGNORECASE)

# extract device
df['DEVICE'] = df['DESC_DETAIL'].str.extract(
    f"({'|'.join(s)})", flags=re.IGNORECASE)

# fill missing company values based on device
df['COMPANY'] = df['COMPANY'].fillna(
    df['DEVICE'].str.lower().map(dict(zip(s.str.lower(), s.index))))

df

Output:

                      DESC_DETAIL   COMPANY   DEVICE
0  Probably task Company2 C2_Dev5  Company2  C2_Dev5
1             File system C3_Dev1  Company3  C3_Dev1
2   Weather subcutaneous Company2  Company2      NaN
3       Company1 Travesty C1_Dev3  Company1  C1_Dev3
4         Does not match anything       NaN      NaN

You need a device to company dictionary as well and you can build it from the ref_dict easily as below:

dev_to_company_dict = {v:l[0] for l in zip(ref_dict.keys(), ref_dict.values()) for v in l[1]}

Then it is easy to do this:

df['COMPANY'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(ref_dict.keys())))
df['COMPANY'].replace('', np.nan, inplace=True)
df['DEVICE'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(dev_to_company_dict.keys())))
df['DEVICE'].replace('', np.nan, inplace=True)
df['COMPANY'] = df['COMPANY'].fillna(df['DEVICE'].map(dev_to_company_dict))

Output:

                       DESC_DETAIL   COMPANY     DEVICE
0   Probably task Company2 C2_Dev5  Company2    C2_Dev5
1   File system C3_Dev1             Company3    C3_Dev1
2   Weather subcutaneous Company2   Company2        NaN
3   Company1 Travesty C1_Dev3       Company1    C1_Dev3
4   Does not match anything              NaN        NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM