简体   繁体   English

基于现有列在 pandas 数据框中创建新列

[英]Create new columns in pandas data frame based on existing column

I have a Python dictionary as follows:我有一个 Python 字典如下:

ref_dict = {
"Company1" :["C1_Dev1","C1_Dev2","C1_Dev3","C1_Dev4","C1_Dev5",],
"Company2" :["C2_Dev1","C2_Dev2","C2_Dev3","C2_Dev4","C2_Dev5",],
"Company3" :["C3_Dev1","C3_Dev2","C3_Dev3","C3_Dev4","C3_Dev5",],
 }

I have a Pandas data frame called df whose one of the columns looks like this:我有一个名为df的 Pandas 数据框,其中一列如下所示:

    DESC_DETAIL
0   Probably task Company2 C2_Dev5
1   File system C3_Dev1
2   Weather subcutaneous Company2
3   Company1 Travesty C1_Dev3
4   Does not match anything 
...........

My goal is to add two extra columns to this data frame and name the columns, COMPANY and DEVICE .我的目标是在此数据框中添加两个额外的列,并将这些列命名为COMPANYDEVICE The value in each row of the COMPANY column will be either be the company key in the dictionary if it exists in the DESC_DETAIL column or if the corresponding device exists in the DESC_DETAIL column. COMPANY列的每一行中的值将是字典中的公司键,如果它存在于DESC_DETAIL列中,或者相应的设备存在于DESC_DETAIL列中。 The value in the DEVICE column will simply be the device string in the DESC_DETAIL column. DEVICE列中的值将只是DESC_DETAIL列中的设备字符串。 If no match is found, the corresponding row is empty.如果未找到匹配项,则对应的行为空。 Hence the final output will look like this:因此最终的 output 将如下所示:

     DESC_DETAIL                        COMPANY         DEVICE
 0   Probably task Company2 C2_Dev5     Company2        C2_Dev5
 1   File system C3_Dev1                Company3        C3_Dev1
 2   Weather subcutaneous Company2      Company2        NaN
 3   Company1 Travesty C1_Dev3          Company1        C1_Dev3
 4   Does not match anything            NaN             NaN

My attempt:我的尝试:

for key, value in ref_dict.items():
    df['COMPANY'] = df.apply(lambda row: key if row['DESC_DETAIL'].isin(key) else Nan, axis=1)

This is obviously just wrong and does not work.这显然是错误的并且不起作用。 How do I make it work?我如何使它工作?

You can extract values with str.extract using a regex pattern:您可以使用正则表达式模式通过str.extract提取值:

import re

s = pd.Series(ref_dict).explode()

# extract company
df['COMPANY'] = df['DESC_DETAIL'].str.extract(
    f"({'|'.join(s.index.unique())})", flags=re.IGNORECASE)

# extract device
df['DEVICE'] = df['DESC_DETAIL'].str.extract(
    f"({'|'.join(s)})", flags=re.IGNORECASE)

# fill missing company values based on device
df['COMPANY'] = df['COMPANY'].fillna(
    df['DEVICE'].str.lower().map(dict(zip(s.str.lower(), s.index))))

df

Output: Output:

                      DESC_DETAIL   COMPANY   DEVICE
0  Probably task Company2 C2_Dev5  Company2  C2_Dev5
1             File system C3_Dev1  Company3  C3_Dev1
2   Weather subcutaneous Company2  Company2      NaN
3       Company1 Travesty C1_Dev3  Company1  C1_Dev3
4         Does not match anything       NaN      NaN

You need a device to company dictionary as well and you can build it from the ref_dict easily as below:您还需要一个设备到公司字典,您可以从ref_dict轻松构建它,如下所示:

dev_to_company_dict = {v:l[0] for l in zip(ref_dict.keys(), ref_dict.values()) for v in l[1]}

Then it is easy to do this:然后很容易做到这一点:

df['COMPANY'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(ref_dict.keys())))
df['COMPANY'].replace('', np.nan, inplace=True)
df['DEVICE'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(dev_to_company_dict.keys())))
df['DEVICE'].replace('', np.nan, inplace=True)
df['COMPANY'] = df['COMPANY'].fillna(df['DEVICE'].map(dev_to_company_dict))

Output: Output:

                       DESC_DETAIL   COMPANY     DEVICE
0   Probably task Company2 C2_Dev5  Company2    C2_Dev5
1   File system C3_Dev1             Company3    C3_Dev1
2   Weather subcutaneous Company2   Company2        NaN
3   Company1 Travesty C1_Dev3       Company1    C1_Dev3
4   Does not match anything              NaN        NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Pandas 从现有列创建新列到数据框 - Create a new column to data frame from existing columns using Pandas 根据现有列中的 2 个条件创建新的 pandas 数据框 - Create new pandas data frame based on 2 conditions in an existing column 基于现有数字列、字符串列表作为列名和元组列表作为值在数据框中创建新列 - Create new columns in a data frame based on an existing numeric column, a list of strings as column names and a list of tuples as values 比较两个熊猫数据框列的元素,并基于第三列创建一个新列 - Compare elements of two pandas data frame columns and create a new column based on a third column 如何基于熊猫中现有列的迭代来创建新列? - How to create new column based on iteration of existing columns in pandas? 如何根据 pandas 中的现有列创建新列? - How do I create a new column based on existing columns in pandas? 分组后基于另一列创建新的数据框列 - Create new data frame columns based on another column after group by 尝试使用 Pandas 数据框中其他两列的 groupby 基于另一列创建新的滚动平均列时出错 - Error when trying to create new rolling average column based on another column using groupby of two other columns in pandas data frame 对 pandas 数据框的多列进行功能设计(在现有列的基础上添加新列) - Feature engineered multiple columns of pandas data frame (add new columns based on existing ones) Pandas:根据其中一列的值将多个新列连接到现有数据帧 - Pandas: concat multiple new columns to an existing data-frame based on the value of one of the columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM