简体   繁体   中英

Comparing Column Values in a Pandas Dataframe

I need to find where column values differ in a given Pandas dataframe.

I've assembled my dataframe using techniques describe here: compare two pandas data frame

Using this code, I can get the added rows and deleted rows between an old and new dataset. Where df1 is the old dataset and df2 is the newer dataset. They have the same schema.

m = df1.merge(df2, on=['ID', 'Name'], how='outer', suffixes=['', '_'])
adds = m.loc[m.GPA_.notnull() & m.GPA.isnull()]
deletes = m.loc[m.GPA_.isnull() & m.GPA.notnull()]

What I want to do is filter out the adds and deletes from the merged dataframe then compare the column values as such:

for col in m.columns:
    m["diff_%s" % field] = m[field] != m["%s_" % field]

This should result in adding multiple boolean columns that check for value changes. So my question is, how can I filter out the add and delete rows first before I apply this column logic?

Additional Information:

_data_orig = [
[1, "Bob", 3.0],
[2, "Sam", 2.0],
[3, "Jane", 4.0]]
_columns = ["ID", "Name", "GPA"]

_data_new = [
        [1, "Bob", 3.2],
        [3, "Jane", 3.9],
        [4, "John", 1.2],
        [5, "Lisa", 2.2]
    ]
_columns = ["ID", "Name", "GPA"]

df1 = pd.DataFrame(data=_data_orig, columns=_columns)
df2 = pd.DataFrame(data=_data_new, columns=_columns)

m = df1.merge(df2, on=['ID', 'Name'], how='outer', suffixes=['', '_'])
adds = m.loc[m.GPA_.notnull() & m.GPA.isnull()]
deletes = m.loc[m.GPA_.isnull() & m.GPA.notnull()]

# TODO: add code to remove adds/deletes here
# array should now be: [[1, "Bob", 3.2],
#        [3, "Jane", 3.9]]
for col in m.columns:
    m["diff_%s" % field] = m[field] != m["%s_" % field]
# results in:
# array with columns ['ID', 'Name', 'GPA', 'Name_', 'GPA_','diff_GPD', 'diff_Name'
# ... DO other stuff
# write to csv

You can use Index.union for concanecate both indexes and then drop rows with idx :

idx = adds.index.union(deletes.index)
print (idx)
Int64Index([1, 3, 4], dtype='int64')

print (m.drop(idx))
   ID  Name  GPA  GPA_
0   1   Bob  3.0   3.2
2   3  Jane  4.0   3.9

Another solution with boolean indexing :

mask = ~((m.GPA_.notnull() & m.GPA.isnull()) | ( m.GPA_.isnull() & m.GPA.notnull()))
print (mask)
0     True
1    False
2     True
3    False
4    False
dtype: bool

print (m[mask])
   ID  Name  GPA  GPA_
0   1   Bob  3.0   3.2
2   3  Jane  4.0   3.9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM