[英]Create new columns in pandas data frame based on existing column
I have a Python dictionary as follows:我有一个 Python 字典如下:
ref_dict = {
"Company1" :["C1_Dev1","C1_Dev2","C1_Dev3","C1_Dev4","C1_Dev5",],
"Company2" :["C2_Dev1","C2_Dev2","C2_Dev3","C2_Dev4","C2_Dev5",],
"Company3" :["C3_Dev1","C3_Dev2","C3_Dev3","C3_Dev4","C3_Dev5",],
}
I have a Pandas data frame called df whose one of the columns looks like this:我有一个名为df的 Pandas 数据框,其中一列如下所示:
DESC_DETAIL
0 Probably task Company2 C2_Dev5
1 File system C3_Dev1
2 Weather subcutaneous Company2
3 Company1 Travesty C1_Dev3
4 Does not match anything
...........
My goal is to add two extra columns to this data frame and name the columns, COMPANY and DEVICE .我的目标是在此数据框中添加两个额外的列,并将这些列命名为COMPANY和DEVICE 。 The value in each row of the COMPANY column will be either be the company key in the dictionary if it exists in the DESC_DETAIL column or if the corresponding device exists in the DESC_DETAIL column.
COMPANY列的每一行中的值将是字典中的公司键,如果它存在于DESC_DETAIL列中,或者相应的设备存在于DESC_DETAIL列中。 The value in the DEVICE column will simply be the device string in the DESC_DETAIL column.
DEVICE列中的值将只是DESC_DETAIL列中的设备字符串。 If no match is found, the corresponding row is empty.
如果未找到匹配项,则对应的行为空。 Hence the final output will look like this:
因此最终的 output 将如下所示:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
My attempt:我的尝试:
for key, value in ref_dict.items():
df['COMPANY'] = df.apply(lambda row: key if row['DESC_DETAIL'].isin(key) else Nan, axis=1)
This is obviously just wrong and does not work.这显然是错误的并且不起作用。 How do I make it work?
我如何使它工作?
You can extract values with str.extract
using a regex pattern:您可以使用正则表达式模式通过
str.extract
提取值:
import re
s = pd.Series(ref_dict).explode()
# extract company
df['COMPANY'] = df['DESC_DETAIL'].str.extract(
f"({'|'.join(s.index.unique())})", flags=re.IGNORECASE)
# extract device
df['DEVICE'] = df['DESC_DETAIL'].str.extract(
f"({'|'.join(s)})", flags=re.IGNORECASE)
# fill missing company values based on device
df['COMPANY'] = df['COMPANY'].fillna(
df['DEVICE'].str.lower().map(dict(zip(s.str.lower(), s.index))))
df
Output: Output:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
You need a device to company dictionary as well and you can build it from the ref_dict
easily as below:您还需要一个设备到公司字典,您可以从
ref_dict
轻松构建它,如下所示:
dev_to_company_dict = {v:l[0] for l in zip(ref_dict.keys(), ref_dict.values()) for v in l[1]}
Then it is easy to do this:然后很容易做到这一点:
df['COMPANY'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(ref_dict.keys())))
df['COMPANY'].replace('', np.nan, inplace=True)
df['DEVICE'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(dev_to_company_dict.keys())))
df['DEVICE'].replace('', np.nan, inplace=True)
df['COMPANY'] = df['COMPANY'].fillna(df['DEVICE'].map(dev_to_company_dict))
Output: Output:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.