简体   繁体   English

将2 .CSV与未知数量的列和名称进行比较

[英]Compare 2 .CSV with unknown number of columns and names

and thanks in advance for any advice. 并提前感谢任何建议。 First-time poster here, so I'll do my best to put in all required info. 这是第一次发布的海报,所以我会尽力提供所有必需的信息。 I am also quite beginner with Python, have been doing some online tutorials, and some copy/paste coding from StackOverflow, it's FrankenCoding... So I'm probably approaching this wrong... 我也是Python的初学者,一直在做一些在线教程,以及StackOverflow的一些复制/粘贴编码,它是FrankenCoding ......所以我可能接近这个错误...

I need to compare two CSV files, that will have a changing number of columns, there will only ever be 2 columns that match (for example, email_address in one file, and EMAIL in the other). 我需要比较两个CSV文件,这些文件的列数会有变化,只有2列匹配(例如,一个文件中的email_address,另一个文件中的EMAIL)。 Both files will have headers, however the names of these headers may change. 这两个文件都有标题,但这些标题的名称可能会更改。 The file sizes may be anywhere from a few thousand lines up to +2,000,000, with potentially 100+ columns (but more likely to have a handful). 文件大小可能从几千行到+2,000,000,可能有100多列(但更可能有一些)。

Output is to a third 'results.csv' file, containing all the info. 输出到第三个'results.csv'文件,包含所有信息。 It may be a merge (all unique entries), a substract (remove entries present in one or the other) or an intersect (all entries present in both). 它可以是合并(所有唯一条目),减法(删除一个或另一个中存在的条目)或交叉(两者中都存在的所有条目)。

I have searched here, and found a lot of good information, but all of the ones I saw had a fixed number of columns in the files. 我在这里搜索过,发现了很多很好的信息,但是我看到的所有信息在文件中都有固定数量的列。 I've tried dict and dictreader, and I know the answer is in there somewhere, but right now, I'm a bit confused. 我尝试过dict和dictreader,我知道答案就在那里,但是现在,我有点困惑。 But since I haven't made any progress in several days, and I can only devote so much time on this, I'm hoping that I can get a nudge in the right direction. 但由于我几天没有取得任何进展,而且我只能在这方面投入太多时间,所以我希望能够朝着正确的方向努力。

Ideally, I want to learn how to do it myself, which means understanding how the data is 'moving around'. 理想情况下,我想学习如何自己动手,这意味着要了解数据是如何“四处移动”的。

Extract of CSV files below, I didn't add more columns then (I think) necessary, the dataset I have now will match on Originalid/UID or emailaddress/email, but this may not always be the case. 提取下面的CSV文件,我没有添加更多列然后(我认为)必要,我现在拥有的数据集将匹配Originalid / UID或emailaddress / email,但情况可能并非总是如此。

Original.csv Original.csv

"originalid","emailaddress",""
"12345678","Bob@mail.com",""
"23456789","NORMA@EMAIL.COM",""
"34567890","HENRY@some-mail.com",""
"45678901","Analisa@sports.com",""
"56789012","greta@mail.org",""
"67890123","STEVEN@EMAIL.ORG",""

Compare.CSV Compare.CSV

"email","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"Bob@mail.com",,,"true"
"NORMA@EMAIL.COM",,,"true"
"HENRY@some-mail.com",,,"true"
"Henrietta@AWESOME.CA",,,"true"
"NORMAN@sports.CA",,,"true"
"albertina@justemail.CA",,,"true"

Data in results.csv should be all columns from Original.CSV + all columns in Compare.csv, but not the matching one (email) : results.csv中的数据应该是Original.CSV + Compare.csv中所有列的所有列,但不是匹配的列(电子邮件):

"originalid","emailaddress","","DATEOFINVALIDATION_WITH_TIME","OPTOUTDATE_WITH_TIME","EMAIL_USERS"
"12345678","Bob@mail.com","",,,"true"
"23456789","NORMA@EMAIL.COM","",,,"true"
"34567890","HENRY@some-mail.com","",,,"true"

Here are my results as they are now: 以下是我现在的结果:

email,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,,,true,"['12345678', 'Bob@mail.com', '']"
NORMA@EMAIL.COM,,,true,"['23456789', 'NORMA@EMAIL.COM', '']"
HENRY@some-mail.com,,,true,"['34567890', 'HENRY@some-mail.com', '']"

And here's where I'm at with the code, the print statement returns matching data from the files to screen but not to file, so I'm missing something in there. 这就是我在使用代码的地方,print语句将文件中的匹配数据返回到屏幕而不是文件,所以我在那里缺少一些东西。
***** And I'm not getting the headers from the original.csv file, data is coming in. *****我没有从original.csv文件中获取标题,数据正在进入。

import csv

def get_column_from_file(filename, column_name):
    f = open(filename, 'r')
    reader = csv.reader(f)
    headers = next(reader, None)
    i = 0
    max = (len(headers))
    while i < max:
        if headers[i] == column_name:
            column_header = i
 #       print(headers[i])
        i = i + 1
    return(column_header)

file_to_check = "Original.csv"
file_console = "Compare.csv"

column_to_read = get_column_from_file(file_console, 'email')
column_to_compare = get_column_from_file(file_to_check, 'emailaddress')

with open(file_console, 'r') as master:
    master_indices = dict((r[1], r) for i, r in enumerate(csv.reader(master)))

with open('Compare.csv', 'r') as hosts:
    with open('results.csv', 'w', newline='') as results:
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []))

        for row in reader:
            index = master_indices.get(row[0])
            if index is not None:
                print (row +[master_indices.get(row[0])])
                writer.writerow(row +[master_indices.get(row[0])])

Thanks for your time! 谢谢你的时间!

Pat

Right now it looks like you only use writerow once for the header: 现在看起来你只使用一次编写器作为标题:

writer.writerow(next(reader, []))

As francisco pointed out, uncommenting that last line may fix your problem. 弗朗西斯科指出,取消注释最后一行可能会解决您的问题。 You can do this by removing the "#" at the beginning of the line. 您可以通过删除行开头的“#”来完成此操作。

I like that you want to do this yourself, and recognize a need to "understand how the data is moving around." 我喜欢你自己想要这样做,并认识到需要“理解数据是如何移动的”。 This is exactly how you should be thinking of the problem: focusing on the movement of data rather than the result. 这正是你应该如何思考这个问题:专注于数据的移动而不是结果。 Some people may disagree with me, but I think this is a good philosophy to follow as it will make future reuse easier. 有些人可能不同意我,但我认为这是一个很好的理念,因为它将使未来的重用更容易。

You're not trying to build a tool that combines two CSVs, you're trying to organize data (that happens to come from a CSV) according to a common reference (email address) and output the result as a CSV. 您没有尝试构建一个组合了两个CSV的工具,您尝试根据公共参考(电子邮件地址)组织数据(恰好来自CSV)并将结果输出为CSV。 Because you are talking about potentially large data sets (+2,000,000 [rows] with potentially 100+ columns) recognize that it is important to pay attention to the asymptotic runtime. 因为您正在讨论潜在的大型数据集(+2,000,000 [行],可能有100多列),所以认识到注意渐近运行时非常重要。 If you do not know what this is, I recommend you read up on Big-O notation and asymptotic algorithm analysis. 如果您不知道这是什么,我建议您阅读Big-O表示法和渐近算法分析。 You might be okay without this. 没有这个你可能没问题。

First you decide what, from each CSV, is your key. 首先,您要确定每个CSV中的哪个是您的密钥。 You've already done this, 'email' for 'Compare.csv' and 'emailaddress' from 'Original.csv'. 你已经在'Original.csv'中为'Compare.csv'和'emailaddress'做了'email'。 Now, build yourself a function to produce dictionaries from the CSV based off the key. 现在,自己构建一个函数,根据密钥从CSV生成字典。

def get_dict_from_csv(path_to_csv, key):
    with open(path_to_csv, 'r') as f:
        reader = csv.reader(f)
        headers, *rest = reader  # requires python3
    key_index = headers.index(key)  # find index of key
    # dictionary comprehensions are your friend, just think about what you want the dict to look like
    d = {row[key_index]: row[:key_index] + row[key_index+1:]  # +1 to skip the email entry
         for row in rest}
    headers.remove(key)
    d['HEADERS'] = headers  # add headers so you know what the information in the dict is
    return d

Now you can call this function on both of your CSVs. 现在,您可以在两个CSV上调用此功能。

file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')

Now you have two dicts which are keyed off the same information. 现在你有两个关键字相同的信息。 Now we need a function to combine these into one dict. 现在我们需要一个函数将它们组合成一个字典。

def combine_dicts(*dicts):
    d, *rest = dicts  # requires python3
    # iteratively pull other dicts into the first one, d
    for r in rest:
        original_headers = d['HEADERS'][:]
        new_headers = r['HEADERS'][:]
        # copy headers
        d['HEADERS'].extend(new_headers)
        # find missing keys
        s = set(d.keys()) - set(r.keys())  # keys present in d but not in r
        for k in s:
            d[k].extend(['', ] * len(new_headers))
        del r['HEADERS']  # we don't want to copy this a second time in the loop below
        for k, v in r.items():
            # use setdefault in case the key didn't exist in the first dict
            d.setdefault(k, ['', ] * len(original_headers)).extend(v)
    return d

Now you have one dict which has all the information you want, all you need to do is write it back as a CSV. 现在你有一个dict,它包含你想要的所有信息,你需要做的就是将它写回CSV。

def write_dict_to_csv(output_file, d, include_key=False):
    with open(output_file, 'w', newline='') as results:
        writer = csv.writer(results)
        # email isn't in your HEADERS, so you'll need to add it
        if include_key:
            headers = ['email',] + d['HEADERS']
        else:
            headers = d['HEADERS']
        writer.writerow(headers)
        # now remove it from the dict so we can iterate over it without including it twice
        del d['HEADERS']
        for k, v in d.items():
            if include_key:
                row = [k,] + v
            else:
                row = v
            writer.writerow(row)

And that should be it. 这应该是它。 To call all of this is just 打电话给所有这些只是

file_console_dict = get_dict_from_csv('Compare.csv', 'email')
file_to_check_dict = get_dict_from_csv('Original.csv', 'emailaddress')
results_dict = combine_dicts(file_to_check_dict, file_console_dict)
write_dict_to_csv('results.csv', results_dict)

And you can easily see how this can be extended to arbitrarily many dictionaries. 你可以很容易地看到它如何扩展到任意多个词典。

You said you didn't want the email to be in the final CSV. 您说您不希望电子邮件出现在最终的CSV中。 This is counter-intuitive to me, so I made it an option in write_dict_to_csv() in case you change your mind. 这对我来说是违反直觉的,所以我在write_dict_to_csv()中做了一个选项,以防你改变主意。

When I run all the above I get 当我运行以上所有内容时,我得到了

email,originalid,,,DATEOFINVALIDATION_WITH_TIME,OPTOUTDATE_WITH_TIME,EMAIL_USERS
Bob@mail.com,12345678,,,,true
NORMA@EMAIL.COM,23456789,,,,true
HENRY@some-mail.com,34567890,,,,true
Analisa@sports.com,45678901,,,,,
greta@mail.org,56789012,,,,,
STEVEN@EMAIL.ORG,67890123,,,,,
Henrietta@AWESOME.CA,,,,,true
NORMAN@sports.CA,,,,,true
albertina@justemail.CA,,,,,true

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM