简体   繁体   English

如何通过比较其他两个列来创建具有值的新列?

[英]How to create a new column with values from comparing two other columns?

I am working on a project that will perform an audit of employees with computer accounts. 我正在做一个项目,该项目将对具有计算机帐户的员工进行审核。 I want to print one data frame with the two new columns in it. 我要打印一个带有两个新列的数据框。 This is different from the Comparing Columns in Dataframes question because I am working with strings. 这与“数据帧中的比较列”问题不同,因为我正在使用字符串。 I will also need to do some fuzzy logic but that is further down the line. 我还需要做一些模糊逻辑,但这是进一步的。

The data I receive is in Excel sheets. 我收到的数据在Excel工作表中。 It comes from two sources that I don't have control over and so I format them to be [First Name, Last Name] and print them to the console to ensure the data I am working with is correct. 它来自两个我无法控制的来源,因此我将其格式化为[First Name,Last Name],然后将它们打印到控制台以确保我正在使用的数据正确。 I convert the .xls to .csv files, format the information and am able to output the two lists of names in a single dataframe with two columns but have not been able to put the values I want in the last two columns. 我将.xls转换为.csv文件,对信息进行了格式设置,并且能够在具有两列的单个数据框中输出两个名称列表,但是无法将我想要的值放在最后两列中。 I have used query (which returned True/False, not the names), diff and regex. 我用过查询(返回的是True / False,不是名称),diff和regex。 I assume that I am just using the tools incorrectly. 我认为我只是错误地使用了这些工具。

    import pandas as pd

    nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary 
          Emerson","Amelia H. Hayden","Abraham Oliver"],
          'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ 
          McMahon","Amelia H. Hayden"]}
    info = pd.DataFrame(data=nd)

    for row in info:
    if info.col1.value not in info.col2:
        info["Need Account"] = info.col1.value

    if info.col2.value not in info.col1:
        info["Delete Account"] = info.col2.value

    print(info)

What I would like is a new dataframe with 2 columns: Need Account and Delete Account and fill in the appropriate values based on the other columns in the dataframe. 我想要一个包含2列的新数据框:“需要帐户”和“删除帐户”,然后根据数据框中的其他列填写适当的值。 In this case, I am getting an error that 'Series' has not attribute 'value'. 在这种情况下,我得到一个错误,即“系列”没有属性“值”。 Here is an example of my expected output: 这是我预期输出的示例:

    df_out: 
    Need Account       Delete Account
    Demetrius McMahon  Abe Oliver
    Abraham Oliver     Hillary Emerson
    Hilary Emerson     DJ McMahon

From this list I can look to see who's nickname showed up and pare the list down from there. 从该列表中,我可以看到出现了谁的昵称,然后从那里删除列表。

I'm taking a chance without seeing your expected output, but reading what you are attempting in your code. 我碰巧没有看到您的预期输出,但阅读了您在代码中尝试的内容。 Let me know if this is what you are looking for? 让我知道这是您要寻找的吗?

nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
      'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"], 
      'Need Account':"", 
      'Delete Account':""
     }
info = pd.DataFrame(data=nd)

print(info)

               col1              col2 Need Account Delete Account
0     Abraham Hansen    Abraham Hansen                            
1  Demetrius McMahon        Abe Oliver                            
2     Hilary Emerson   Hillary Emerson                            
3   Amelia H. Hayden        DJ McMahon                            
4     Abraham Oliver  Amelia H. Hayden    

Don't use loops, use vectors... 不要使用循环,使用向量...

info.loc[info['col1'] != info['col2'], 'Need Account'] = info['col1']
info.loc[info['col2'] != info['col1'], 'Delete Account'] = info['col2']

print(info)

               col1              col2       Need Account    Delete Account
0     Abraham Hansen    Abraham Hansen                                     
1  Demetrius McMahon        Abe Oliver  Demetrius McMahon        Abe Oliver
2     Hilary Emerson   Hillary Emerson     Hilary Emerson   Hillary Emerson
3   Amelia H. Hayden        DJ McMahon   Amelia H. Hayden        DJ McMahon
4     Abraham Oliver  Amelia H. Hayden     Abraham Oliver  Amelia H. Hayden

You want to use isin and np.where to conditionally assign the new values: 你想用isinnp.where有条件地赋予新的价值:

info['Need Account'] = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
info['Delete Account'] = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)

                col1              col2       Need Account   Delete Account
0     Abraham Hansen    Abraham Hansen                NaN              NaN
1  Demetrius McMahon        Abe Oliver  Demetrius McMahon       Abe Oliver
2     Hilary Emerson   Hillary Emerson     Hilary Emerson  Hillary Emerson
3   Amelia H. Hayden        DJ McMahon                NaN       DJ McMahon
4     Abraham Oliver  Amelia H. Hayden     Abraham Oliver              NaN

Or if you want a new dataframe like you stated in your question: 或者,如果您想要一个新的数据框,如您在问题中所述:

need = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
delete = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)

newdf = pd.DataFrame({'Need Account':need,
                      'Delete Account':delete})

        Need Account   Delete Account
0                NaN              NaN
1  Demetrius McMahon       Abe Oliver
2     Hilary Emerson  Hillary Emerson
3                NaN       DJ McMahon
4     Abraham Oliver              NaN

IIUC, it doesn't seem like there is much 'structure' to be maintained from your input dataframe, so you could use sets to compare membership in groups directly. IIUC,似乎从输入数据框中不需要维护太多的“结构”,因此您可以使用集合直接比较组中的成员身份。

nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
      'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"]}
df = pd.DataFrame(data=nd)

col1 = set(df['col1'])
col2 = set(df['col2'])

need = col1 - col2
delete = col2 - col1

print('need = ', need)
print('delete =  ', delete)

yields 产量

need =  {'Hilary Emerson', 'Demetrius McMahon', 'Abraham Oliver'}
delete =   {'Hillary Emerson', 'DJ McMahon', 'Abe Oliver'}

You could then place in a new dataframe: 然后,您可以放置​​在新的数据框中:

data = {'need':list(need), 'delete':list(delete)}
new_df = pd.DataFrame.from_dict(data, orient='index').transpose()

(Edited to account for possibility that need and delete are of unequal length.) (编辑以考虑到needdelete的长度不相等的可能性。)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 根据其他两列的值,在 pandas 中创建一个新列 - Create a new column in pandas depending on values from two other columns Python:根据其他两个列的值有条件地创建新列 - Python: create new column conditionally on values from two other columns 如何根据来自其他两列的值的分组总和创建新的值列? - How can I create a new column of values based on the grouped sum of values from two other columns? 如何创建一个新的 dataframe 包含比较其他两个数据帧的差异值? - How can I create a new dataframe that contains the difference values from comparing two other dataframes? pandas,根据其他两列的值创建一个新的唯一标识符列 - pandas, create a new unique identifier column based on values from two other columns 根据来自其他两列的条件文本值在 Pandas 中创建一个新列 - Create a new column in pandas based on conditional text values from two other columns 如何将 DataFrame 按两列分组,并使用第三列的最小值和最大值创建两个新列? - How to group DataFrame by two columns and create two new columns with min and max values from third column? 如何使用不同的Id和其他两个列的不同值创建新列? - How to create a new column with distinct Id and different values of two other columns? 使用 dataframe 中其他两列的条件创建一个新列 - Create a new column using a condition from other two columns in a dataframe 从其他 2 列的值创建新的 dataframe 列 - Create new dataframe column from the values of 2 other columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM