I have a Python dictionary as follows:
ref_dict = {
"Company1" :["C1_Dev1","C1_Dev2","C1_Dev3","C1_Dev4","C1_Dev5",],
"Company2" :["C2_Dev1","C2_Dev2","C2_Dev3","C2_Dev4","C2_Dev5",],
"Company3" :["C3_Dev1","C3_Dev2","C3_Dev3","C3_Dev4","C3_Dev5",],
}
I have a Pandas data frame called df whose one of the columns looks like this:
DESC_DETAIL
0 Probably task Company2 C2_Dev5
1 File system C3_Dev1
2 Weather subcutaneous Company2
3 Company1 Travesty C1_Dev3
4 Does not match anything
...........
My goal is to add two extra columns to this data frame and name the columns, COMPANY and DEVICE . The value in each row of the COMPANY column will be either be the company key in the dictionary if it exists in the DESC_DETAIL column or if the corresponding device exists in the DESC_DETAIL column. The value in the DEVICE column will simply be the device string in the DESC_DETAIL column. If no match is found, the corresponding row is empty. Hence the final output will look like this:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
My attempt:
for key, value in ref_dict.items():
df['COMPANY'] = df.apply(lambda row: key if row['DESC_DETAIL'].isin(key) else Nan, axis=1)
This is obviously just wrong and does not work. How do I make it work?
You can extract values with str.extract
using a regex pattern:
import re
s = pd.Series(ref_dict).explode()
# extract company
df['COMPANY'] = df['DESC_DETAIL'].str.extract(
f"({'|'.join(s.index.unique())})", flags=re.IGNORECASE)
# extract device
df['DEVICE'] = df['DESC_DETAIL'].str.extract(
f"({'|'.join(s)})", flags=re.IGNORECASE)
# fill missing company values based on device
df['COMPANY'] = df['COMPANY'].fillna(
df['DEVICE'].str.lower().map(dict(zip(s.str.lower(), s.index))))
df
Output:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
You need a device to company dictionary as well and you can build it from the ref_dict
easily as below:
dev_to_company_dict = {v:l[0] for l in zip(ref_dict.keys(), ref_dict.values()) for v in l[1]}
Then it is easy to do this:
df['COMPANY'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(ref_dict.keys())))
df['COMPANY'].replace('', np.nan, inplace=True)
df['DEVICE'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(dev_to_company_dict.keys())))
df['DEVICE'].replace('', np.nan, inplace=True)
df['COMPANY'] = df['COMPANY'].fillna(df['DEVICE'].map(dev_to_company_dict))
Output:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.