[英]How to create a new column with values from comparing two other columns?
I am working on a project that will perform an audit of employees with computer accounts. 我正在做一个项目,该项目将对具有计算机帐户的员工进行审核。 I want to print one data frame with the two new columns in it.
我要打印一个带有两个新列的数据框。 This is different from the Comparing Columns in Dataframes question because I am working with strings.
这与“数据帧中的比较列”问题不同,因为我正在使用字符串。 I will also need to do some fuzzy logic but that is further down the line.
我还需要做一些模糊逻辑,但这是进一步的。
The data I receive is in Excel sheets. 我收到的数据在Excel工作表中。 It comes from two sources that I don't have control over and so I format them to be [First Name, Last Name] and print them to the console to ensure the data I am working with is correct.
它来自两个我无法控制的来源,因此我将其格式化为[First Name,Last Name],然后将它们打印到控制台以确保我正在使用的数据正确。 I convert the .xls to .csv files, format the information and am able to output the two lists of names in a single dataframe with two columns but have not been able to put the values I want in the last two columns.
我将.xls转换为.csv文件,对信息进行了格式设置,并且能够在具有两列的单个数据框中输出两个名称列表,但是无法将我想要的值放在最后两列中。 I have used query (which returned True/False, not the names), diff and regex.
我用过查询(返回的是True / False,不是名称),diff和regex。 I assume that I am just using the tools incorrectly.
我认为我只是错误地使用了这些工具。
import pandas as pd
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary
Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ
McMahon","Amelia H. Hayden"]}
info = pd.DataFrame(data=nd)
for row in info:
if info.col1.value not in info.col2:
info["Need Account"] = info.col1.value
if info.col2.value not in info.col1:
info["Delete Account"] = info.col2.value
print(info)
What I would like is a new dataframe with 2 columns: Need Account and Delete Account and fill in the appropriate values based on the other columns in the dataframe. 我想要一个包含2列的新数据框:“需要帐户”和“删除帐户”,然后根据数据框中的其他列填写适当的值。 In this case, I am getting an error that 'Series' has not attribute 'value'.
在这种情况下,我得到一个错误,即“系列”没有属性“值”。 Here is an example of my expected output:
这是我预期输出的示例:
df_out:
Need Account Delete Account
Demetrius McMahon Abe Oliver
Abraham Oliver Hillary Emerson
Hilary Emerson DJ McMahon
From this list I can look to see who's nickname showed up and pare the list down from there. 从该列表中,我可以看到出现了谁的昵称,然后从那里删除列表。
I'm taking a chance without seeing your expected output, but reading what you are attempting in your code. 我碰巧没有看到您的预期输出,但阅读了您在代码中尝试的内容。 Let me know if this is what you are looking for?
让我知道这是您要寻找的吗?
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"],
'Need Account':"",
'Delete Account':""
}
info = pd.DataFrame(data=nd)
print(info)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen
1 Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon
4 Abraham Oliver Amelia H. Hayden
Don't use loops, use vectors... 不要使用循环,使用向量...
info.loc[info['col1'] != info['col2'], 'Need Account'] = info['col1']
info.loc[info['col2'] != info['col1'], 'Delete Account'] = info['col2']
print(info)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen
1 Demetrius McMahon Abe Oliver Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon Amelia H. Hayden DJ McMahon
4 Abraham Oliver Amelia H. Hayden Abraham Oliver Amelia H. Hayden
You want to use isin
and np.where
to conditionally assign the new values: 你想用
isin
和np.where
有条件地赋予新的价值:
info['Need Account'] = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
info['Delete Account'] = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)
col1 col2 Need Account Delete Account
0 Abraham Hansen Abraham Hansen NaN NaN
1 Demetrius McMahon Abe Oliver Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson Hilary Emerson Hillary Emerson
3 Amelia H. Hayden DJ McMahon NaN DJ McMahon
4 Abraham Oliver Amelia H. Hayden Abraham Oliver NaN
Or if you want a new dataframe like you stated in your question: 或者,如果您想要一个新的数据框,如您在问题中所述:
need = np.where(~info['col1'].isin(info['col2']), info['col1'], np.NaN)
delete = np.where(~info['col2'].isin(info['col1']), info['col2'], np.NaN)
newdf = pd.DataFrame({'Need Account':need,
'Delete Account':delete})
Need Account Delete Account
0 NaN NaN
1 Demetrius McMahon Abe Oliver
2 Hilary Emerson Hillary Emerson
3 NaN DJ McMahon
4 Abraham Oliver NaN
IIUC, it doesn't seem like there is much 'structure' to be maintained from your input dataframe, so you could use sets to compare membership in groups directly. IIUC,似乎从输入数据框中不需要维护太多的“结构”,因此您可以使用集合直接比较组中的成员身份。
nd = {'col1': ["Abraham Hansen","Demetrius McMahon","Hilary Emerson","Amelia H. Hayden","Abraham Oliver"],
'col2': ["Abraham Hansen","Abe Oliver","Hillary Emerson","DJ McMahon","Amelia H. Hayden"]}
df = pd.DataFrame(data=nd)
col1 = set(df['col1'])
col2 = set(df['col2'])
need = col1 - col2
delete = col2 - col1
print('need = ', need)
print('delete = ', delete)
yields 产量
need = {'Hilary Emerson', 'Demetrius McMahon', 'Abraham Oliver'}
delete = {'Hillary Emerson', 'DJ McMahon', 'Abe Oliver'}
You could then place in a new dataframe: 然后,您可以放置在新的数据框中:
data = {'need':list(need), 'delete':list(delete)}
new_df = pd.DataFrame.from_dict(data, orient='index').transpose()
(Edited to account for possibility that need
and delete
are of unequal length.) (编辑以考虑到
need
和delete
的长度不相等的可能性。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.