简体   繁体   中英

Compare 2 .CSV with unknown number of columns and names

and thanks in advance for any advice. First-time poster here, so I'll do my best to put in all required info. I am also quite beginner with Python, have been doing some online tutorials, and some copy/paste coding from StackOverflow, it's FrankenCoding... So I'm probably approaching this wrong...

I need to compare two CSV files, that will have a changing number of columns, there will only ever be 2 columns that match (for example, email_address in one file, and EMAIL in the other). Both files will have headers, however the names of these headers may change. The file sizes may be anywhere from a few thousand lines up to +2,000,000, with potentially 100+ columns (but more likely to have a handful).

Output is to a third 'results.csv' file, containing all the info. It may be a merge (all unique entries), a substract (remove entries present in one or the other) or an intersect (all entries present in both).

I have searched here, and found a lot of good information, but all of the ones I saw had a fixed number of columns in the files. I've tried dict and dictreader, and I know the answer is in there somewhere, but right now, I'm a bit confused. But since I haven't made any progress in several days, and I can only devote so much time on this, I'm hoping that I can get a nudge in the right direction.

Ideally, I want to learn how to do it myself, which means understanding how the data is 'moving around'.

Extract of CSV files below, I didn't add more columns then (I think) necessary, the dataset I have now will match on Originalid/UID or emailaddress/email, but this may not always be the case.

Original.csv

"originalid","emailaddress",""
"12345678","Bob@mail.com",""
"23456789","NORMA@EMAIL.COM",""
"34567890","HENRY@some-mail.com",""
"45678901","Analisa@sports.com",""
"56789012","greta@mail.org",""
"67890123","STEVEN@EMAIL.ORG",""

Compare.CSV

"email","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"Bob@mail.com",,,"true"
"NORMA@EMAIL.COM",,,"true"
"HENRY@some-mail.com",,,"true"
"Henrietta@AWESOME.CA",,,"true"
"NORMAN@sports.CA",,,"true"
"albertina@justemail.CA",,,"true"

Data in results.csv should be all columns from Original.CSV + all columns in Compare.csv, but not the matching one (email) :

"originalid","emailaddress","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"12345678","Bob@mail.com","",,,"true"
"23456789","NORMA@EMAIL.COM","",,,"true"
"34567890","HENRY@some-mail.com","",,,"true"

Here are my results as they are now:

email,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,,,true,"['12345678', 'Bob@mail.com', '']"
NORMA@EMAIL.COM,,,true,"['23456789', 'NORMA@EMAIL.COM', '']"
HENRY@some-mail.com,,,true,"['34567890', 'HENRY@some-mail.com', '']"

And here's where I'm at with the code, the print statement returns matching data from the files to screen but not to file, so I'm missing something in there.
***** And I'm not getting the headers from the original.csv file, data is coming in.

import csv

def get_column_from_file(filename, column_name):
    f = open(filename, 'r')
    reader = csv.reader(f)
    headers = next(reader, None)
    i = 0
    max = (len(headers))
    while i < max:
        if headers[i] == column_name:
            column_header = i
 #       print(headers[i])
        i = i + 1
    return(column_header)

file_to_check = "Original.csv"
file_console = "Compare.csv"

column_to_read = get_column_from_file(file_console, 'email')
column_to_compare = get_column_from_file(file_to_check, 'emailaddress')

with open(file_console, 'r') as master:
    master_indices = dict((r[1], r) for i, r in enumerate(csv.reader(master)))

with open('Compare.csv', 'r') as hosts:
    with open('results.csv', 'w', newline='') as results:
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []))

        for row in reader:
            index = master_indices.get(row[0])
            if index is not None:
                print (row +[master_indices.get(row[0])])
                writer.writerow(row +[master_indices.get(row[0])])

Thanks for your time!

Pat

Right now it looks like you only use writerow once for the header:

writer.writerow(next(reader, []))

As francisco pointed out, uncommenting that last line may fix your problem. You can do this by removing the "#" at the beginning of the line.

I like that you want to do this yourself, and recognize a need to "understand how the data is moving around." This is exactly how you should be thinking of the problem: focusing on the movement of data rather than the result. Some people may disagree with me, but I think this is a good philosophy to follow as it will make future reuse easier.

You're not trying to build a tool that combines two CSVs, you're trying to organize data (that happens to come from a CSV) according to a common reference (email address) and output the result as a CSV. Because you are talking about potentially large data sets (+2,000,000 [rows] with potentially 100+ columns) recognize that it is important to pay attention to the asymptotic runtime. If you do not know what this is, I recommend you read up on Big-O notation and asymptotic algorithm analysis. You might be okay without this.

First you decide what, from each CSV, is your key. You've already done this, 'email' for 'Compare.csv' and 'emailaddress' from 'Original.csv'. Now, build yourself a function to produce dictionaries from the CSV based off the key.

def get_dict_from_csv(path_to_csv, key):
    with open(path_to_csv, 'r') as f:
        reader = csv.reader(f)
        headers, *rest = reader  # requires python3
    key_index = headers.index(key)  # find index of key
    # dictionary comprehensions are your friend, just think about what you want the dict to look like
    d = {row[key_index]: row[:key_index] + row[key_index+1:]  # +1 to skip the email entry
         for row in rest}
    headers.remove(key)
    d['HEADERS'] = headers  # add headers so you know what the information in the dict is
    return d

Now you can call this function on both of your CSVs.

file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')

Now you have two dicts which are keyed off the same information. Now we need a function to combine these into one dict.

def combine_dicts(*dicts):
    d, *rest = dicts  # requires python3
    # iteratively pull other dicts into the first one, d
    for r in rest:
        original_headers = d['HEADERS'][:]
        new_headers = r['HEADERS'][:]
        # copy headers
        d['HEADERS'].extend(new_headers)
        # find missing keys
        s = set(d.keys()) - set(r.keys())  # keys present in d but not in r
        for k in s:
            d[k].extend(['', ] * len(new_headers))
        del r['HEADERS']  # we don't want to copy this a second time in the loop below
        for k, v in r.items():
            # use setdefault in case the key didn't exist in the first dict
            d.setdefault(k, ['', ] * len(original_headers)).extend(v)
    return d

Now you have one dict which has all the information you want, all you need to do is write it back as a CSV.

def write_dict_to_csv(output_file, d, include_key=False):
    with open(output_file, 'w', newline='') as results:
        writer = csv.writer(results)
        # email isn't in your HEADERS, so you'll need to add it
        if include_key:
            headers = ['email',] + d['HEADERS']
        else:
            headers = d['HEADERS']
        writer.writerow(headers)
        # now remove it from the dict so we can iterate over it without including it twice
        del d['HEADERS']
        for k, v in d.items():
            if include_key:
                row = [k,] + v
            else:
                row = v
            writer.writerow(row)

And that should be it. To call all of this is just

file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
results_dict = combine_dicts(file_to_check_dict, file_console_dict)
write_dict_to_csv('results.csv', results_dict)

And you can easily see how this can be extended to arbitrarily many dictionaries.

You said you didn't want the email to be in the final CSV. This is counter-intuitive to me, so I made it an option in write_dict_to_csv() in case you change your mind.

When I run all the above I get

email,originalid,,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,12345678,,,,true
NORMA@EMAIL.COM,23456789,,,,true
HENRY@some-mail.com,34567890,,,,true
Analisa@sports.com,45678901,,,,,
greta@mail.org,56789012,,,,,
STEVEN@EMAIL.ORG,67890123,,,,,
Henrietta@AWESOME.CA,,,,,true
NORMAN@sports.CA,,,,,true
albertina@justemail.CA,,,,,true

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM