简体   繁体   中英

How to merge rows of data set based on some conditions

I have a table of data (sourced from a CSV file) that I want to read in and process with some logic used to consolidate rows. Here is an example of the data:

john,john@domain.com,50
john doe,john@domain.com,10
john doe,john.doe@domain.com,100
mary,mary@domain.com,500

This data represents a table of data with 3 columns and 4 rows. Column 1 is a name ("first" or "first last"), email address, and amount of money that person has.

The goal of my program is to consolidate information for the same user. The challenge is determining which users are in fact the same person. For example, the first 3 rows are the same person. This is because "john doe" has two email addresses, and two different names. The logic for how I determine if someone is the "same" as another person is as follows:

  1. If two names with both a first and last name are identical, they are the same person. We ignore comparison of names with no last name because that is too ambiguous.
  2. If two rows have an identical email address, that is the same person. Doesn't matter if the names are different.

The precedence of the comparisons should be:

  1. Match names first
  2. Match email addresses second

When we consolidate, we need to keep track of:

  1. Multiple names a person is known by
  2. Multiple email addresses a person is known by
  3. A sum of the amount of money they own

So if I work this out iteratively, based on the data set above, the first iteration (consolidate by name) yields this result:

Name      | Email(s)                             | Money
-----------------------------------------------------------
john      | john@domain.com                      | 50
john doe  | john@domain.com, john.doe@domain.com | 110
mary      | mary@domain.com                      | 500

The second iteration, consolidation by email, yields this final result:

Name(s)         | Email(s)                             | Money
-----------------------------------------------------------
john doe, john  | john@domain.com, john.doe@domain.com | 160
mary            | mary@domain.com                      | 500

I'd like to write a Python 3 script that performs this type of consolidation of data. I've tried various attempts at this but it always gets really nasty. I end up with tons of nested loops or list comprehension. I haven't gotten anything working yet, so I unfortunately don't have anything to share.

My gut feel is that there is a pythonic one- or two-liner somewhere to do this.

A simple way is to create a dictionary with id, names, emails, and money. And for each row, search whether the name or email is already in the dictionary. If yes, then update the dictionary, else add the name email into the dictionary with a new id. The code will look like the following:

data_dict = {'1':{'Names':['john doe', 'john'], 'Emails':['john.doe@domain.com'], 'Money':0},
             '2':{'Names':['mary'], 'Emails':['mary@domain.com'], 'Money':0}
             }
for name in df[name]:
    for key in data_dict:
        if name in data_dict[key]['Names']:
            #update data_dict
        else:
           # add to data_dict

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM