简体   繁体   English

使用python和打印匹配比较两个csv文件中的第一列

[英]Comparing the first columns in two csv files using python and printing matches

I have two csv files each which contain ngrams that look like this: 我有两个csv文件,每个文件包含如下所示的ngrams:

drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8

It's a three word phrase followed by a frequency number followed by a relative frequency number. 它是一个三字短语,后跟一个频率编号,后跟一个相对频率编号。

I want to write a script that finds the ngrams that are in both csv files, divides their relative frequencies, and prints them to a new csv file. 我想编写一个脚本,找到两个csv文件中的ngrams,划分它们的相对频率,并将它们打印到新的csv文件。 I want it to find a match whenever the three word phrase matches a three word phrase in the other file and then divide the relative frequency of the phrase in the first csv file by the relative frequency of that same phrase in the second csv file. 我想让它在三个单词短语与另一个文件中的三个单词短语匹配时找到匹配,然后将第一个csv文件中短语的相对频率除以第二个csv文件中该相同短语的相对频率。 Then I want to print the phrase and the division of the two relative frequencies to a new csv file. 然后我想打印短语和两个相对频率的划分到一个新的csv文件。

Below is as far as I've gotten. 到目前为止,我已经到了。 My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. 我的脚本是比较线,但只有在整条线(包括频率和相对频率)完全匹配时才找到匹配。 I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently. 我意识到这是因为我找到了两套完整的交集,但我不知道如何以不同的方式做到这一点。 Please forgive me; 请原谅我; I'm new to coding. 我是编码的新手。 Any help you can give me to get a little closer would be such a big help. 任何帮助你可以让我更近一点将是一个很大的帮助。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))

matches = set(first_set).intersection(secnd_set)

c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)

print matches
print len(matches)

Without dump res in a new file (tedious). 没有转储res在新文件中(繁琐)。 The idea is that the first element is the phrase and the other two the frequencies. 这个想法是第一个元素是短语,另外两个是频率。 Using dict instead of set to do matching and mapping together. 使用dict而不是set来进行匹配和映射。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}

res = {}
for k,v in f_dict.items():
    if k in s_dict:
        res[k] = float(v[1])/float(s_dict[k][1])

print(res)

My script is comparing lines but only finds a match when the entire line (including the frequencies and relative frequencies) matches exactly. 我的脚本是比较线,但只有在整条线(包括频率和相对频率)完全匹配时才找到匹配。 I realize that that is because I'm finding the intersection between two entire sets but I have no idea how to do this differently. 我意识到这是因为我找到了两套完整的交集,但我不知道如何以不同的方式做到这一点。

This is exactly what dictionaries are used for: when you have a separate key and value (or when only part of the value is the key). 这正是字典的用法:当你有一个单独的键和值时(或者当只有部分值是键时)。 So: 所以:

a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}

Now, you can't directly use set methods on dictionaries. 现在,您不能直接在字典上使用set方法。 Python 3 offers some help here, but you're using 2.7. Python 3在这里提供了一些帮助,但你使用的是2.7。 So, you have to write it explicitly: 所以,你必须明确地写它:

matches = {key for key in a_dict if key in b_dict}

Or: 要么:

matches = set(a_dict) & set(b_dict)

But you really don't need the set; 但你真的不需要这套; all you want to do here is iterate over them. 你想在这里做的就是迭代它们。 So: 所以:

for key in a_dict:
    if key in b_dict:
        a_values = a_dict[key]
        b_values = b_dict[key]
        do_stuff_with(a_values[2], b_values[2])

As a side note, you really don't need to build up the lists in the first place just to turn them into sets, or dicts. 作为旁注,您实际上不需要首先构建列表,只是为了将它们转换为集合或词组。 Just build up the sets or dicts: 只需建立集合或词组:

a_set = set()
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_set.add(tuple(row))

a_dict = {}
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_dict[row[0]] = row

Also, if you know about comprehensions, all three versions are crying out to be converted: 此外,如果您了解理解,那么所有三个版本都迫切需要转换:

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    # Now any of these
    a_list = list(reader)
    a_set = {tuple(row) for row in reader}
    a_dict = {row[0]: row for row in reader}

You could store the relative frequencies from the 1st file into a dictionary, then iterate over the 2nd file and if the 1st column matches anything seen in the original file, write out the result directly to the output file: 您可以将第一个文件中的相对频率存储到字典中,然后遍历第二个文件,如果第一列与原始文件中看到的任何内容匹配,则将结果直接写入输出文件:

import csv

tmp = {}

# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
    # using tuple unpacking to extract fixed number of columns from each row
    for txt, abs, rel in csv.reader(fr):
        # converting strings like "1.435486010883783160220299732E-8"
        # to float numbers
        tmp[txt] = float(rel)

with open("matchedngrams.csv", "wb") as fw:
    writer = csv.writer(fw)

    # the 2nd input file will be processed per 1 line to save memory
    # the order of items from this file will be preserved
    with open("ngramstest.csv", "rb") as fr:
        for txt, abs, rel in csv.reader(fr):
            if txt in tmp:
                # not sure what you want to do with absolute, I use 0 here:
                writer.writerow((txt, 0, tmp[txt] / float(rel)))

Avoid saving small numbers as they are, they go into underflow problems (see What are arithmetic underflow and overflow in C? ), dividing a small number with another will give you even more underflow problem, so do this to preprocess your relative frequencies as such: 避免保存原来的小数字,它们会遇到下溢问题(请参阅C中的算术下溢和溢出是什么? ),将一个小数字除以另一个会给你带来更多的下溢问题,所以这样做是为了预处理你的相对频率:

>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08

Now to the reading of the csv . 现在来阅读csv Since you're only manipulating the relative frequencies, your 2nd column don't matter, so let's skip that and save the first column (ie the phrases) as key and third column (ie relative freq) as value: 由于你只是操纵相对频率,你的第二列并不重要,所以让我们跳过它并将第一列(即短语)保存为键和第三列(即相对频率)作为值:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))

Now for the tricky part you want division of the relative frequency of ngramdict2's phrases by ngramdict1's phrases, ie: 现在,对于棘手的部分,你想要通过ngramdict1的短语来划分ngramdict2的短语的相对频率,即:

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1

Since we kept the relative frequencies in logarithic units, we don't have to divide but to simply subtract it, ie 由于我们将相对频率保持在对数单位,因此我们不必划分,而是简单地减去它,即

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1

And to get the phrases that occurs in both, you wont need to check the phrases one by one simply use cast the dictionary.keys() into a set and then do set1.intersection(set2) , see https://docs.python.org/2/tutorial/datastructures.html 要获得两者中出现的短语,您不需要逐个检查短语,只需使用将dictionary.keys()转换为集合然后执行set1.intersection(set2) ,请参阅https://docs.python .ORG / 2 /教程/ datastructures.html

phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

print overlap_phrases

[out]: [OUT]:

set(['drinks while strutting', 'the state face', 'and since that'])

So now let's print it out with the relative frequencies: 所以现在让我们用相对频率打印出来:

with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

The ngramcombined.csv looks like this: ngramcombined.csv看起来像这样:

drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056

Here's the full code: 这是完整的代码:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))


# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

If you like SUPER UNREADBLE but short code (in no. of lines): 如果您喜欢SUPER UNREADBLE但是短代码(在行数中):

import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}

# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM