簡體   English   中英

使用python和打印匹配比較兩個csv文件中的第一列

[英]Comparing the first columns in two csv files using python and printing matches

我有兩個csv文件,每個文件包含如下所示的ngrams:

drinks while strutting,4,1.435486010883783160220299732E-8
and since that,6,4.306458032651349480660899195E-8
the state face,3,2.153229016325674740330449597E-8

它是一個三字短語,后跟一個頻率編號,后跟一個相對頻率編號。

我想編寫一個腳本,找到兩個csv文件中的ngrams,划分它們的相對頻率,並將它們打印到新的csv文件。 我想讓它在三個單詞短語與另一個文件中的三個單詞短語匹配時找到匹配,然后將第一個csv文件中短語的相對頻率除以第二個csv文件中該相同短語的相對頻率。 然后我想打印短語和兩個相對頻率的划分到一個新的csv文件。

到目前為止,我已經到了。 我的腳本是比較線,但只有在整條線(包括頻率和相對頻率)完全匹配時才找到匹配。 我意識到這是因為我找到了兩套完整的交集,但我不知道如何以不同的方式做到這一點。 請原諒我; 我是編碼的新手。 任何幫助你可以讓我更近一點將是一個很大的幫助。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

first_set = set(map(tuple, alist))
secnd_set = set(map(tuple, blist))

matches = set(first_set).intersection(secnd_set)

c = csv.writer(open("matchedngrams.csv", "a"))
c.writerow(matches)

print matches
print len(matches)

沒有轉儲res在新文件中(繁瑣)。 這個想法是第一個元素是短語,另外兩個是頻率。 使用dict而不是set來進行匹配和映射。

import csv
import io 

alist, blist = [], []

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        alist.append(row)
with open("ngramstest.csv", "rb") as fileB:
    reader = csv.reader(fileB, delimiter=',')
    for row in reader:
        blist.append(row)

f_dict = {e[0]:e[1:] for e in alist}
s_dict = {e[0]:e[1:] for e in blist}

res = {}
for k,v in f_dict.items():
    if k in s_dict:
        res[k] = float(v[1])/float(s_dict[k][1])

print(res)

我的腳本是比較線,但只有在整條線(包括頻率和相對頻率)完全匹配時才找到匹配。 我意識到這是因為我找到了兩套完整的交集,但我不知道如何以不同的方式做到這一點。

這正是字典的用法:當你有一個單獨的鍵和值時(或者當只有部分值是鍵時)。 所以:

a_dict = {row[0]: row for row in alist}
b_dict = {row[0]: row for row in blist}

現在,您不能直接在字典上使用set方法。 Python 3在這里提供了一些幫助,但你使用的是2.7。 所以,你必須明確地寫它:

matches = {key for key in a_dict if key in b_dict}

要么:

matches = set(a_dict) & set(b_dict)

但你真的不需要這套; 你想在這里做的就是迭代它們。 所以:

for key in a_dict:
    if key in b_dict:
        a_values = a_dict[key]
        b_values = b_dict[key]
        do_stuff_with(a_values[2], b_values[2])

作為旁注,您實際上不需要首先構建列表,只是為了將它們轉換為集合或詞組。 只需建立集合或詞組:

a_set = set()
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_set.add(tuple(row))

a_dict = {}
with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    for row in reader:
        a_dict[row[0]] = row

此外,如果您了解理解,那么所有三個版本都迫切需要轉換:

with open("ngrams.csv", "rb") as fileA:
    reader = csv.reader(fileA, delimiter=',')
    # Now any of these
    a_list = list(reader)
    a_set = {tuple(row) for row in reader}
    a_dict = {row[0]: row for row in reader}

您可以將第一個文件中的相對頻率存儲到字典中,然后遍歷第二個文件,如果第一列與原始文件中看到的任何內容匹配,則將結果直接寫入輸出文件:

import csv

tmp = {}

# if 1 file is much larger than the other, load the smaller one here
# make sure it will fit into the memory
with open("ngrams.csv", "rb") as fr:
    # using tuple unpacking to extract fixed number of columns from each row
    for txt, abs, rel in csv.reader(fr):
        # converting strings like "1.435486010883783160220299732E-8"
        # to float numbers
        tmp[txt] = float(rel)

with open("matchedngrams.csv", "wb") as fw:
    writer = csv.writer(fw)

    # the 2nd input file will be processed per 1 line to save memory
    # the order of items from this file will be preserved
    with open("ngramstest.csv", "rb") as fr:
        for txt, abs, rel in csv.reader(fr):
            if txt in tmp:
                # not sure what you want to do with absolute, I use 0 here:
                writer.writerow((txt, 0, tmp[txt] / float(rel)))

避免保存原來的小數字,它們會遇到下溢問題(請參閱C中的算術下溢和溢出是什么? ),將一個小數字除以另一個會給你帶來更多的下溢問題,所以這樣做是為了預處理你的相對頻率:

>>> import math
>>> num = 1.435486010883783160220299732E-8
>>> logged = math.log(num)
>>> logged
-18.0591772685384
>>> math.exp(logged)
1.4354860108837844e-08

現在來閱讀csv 由於你只是操縱相對頻率,你的第二列並不重要,所以讓我們跳過它並將第一列(即短語)保存為鍵和第三列(即相對頻率)作為值:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))

現在,對於棘手的部分,你想要通過ngramdict1的短語來划分ngramdict2的短語的相對頻率,即:

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  relfreq = relfreq_from_ngramdict2 / relfreq_from_ngramdict1

由於我們將相對頻率保持在對數單位,因此我們不必划分,而是簡單地減去它,即

if phrase_from_ngramdict1 == phrase_from_ngramdict2:
  logrelfreq = logrelfreq_from_ngramdict2 - logrelfreq_from_ngramdict1

要獲得兩者中出現的短語,您不需要逐個檢查短語,只需使用將dictionary.keys()轉換為集合然后執行set1.intersection(set2) ,請參閱https://docs.python .ORG / 2 /教程/ datastructures.html

phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

print overlap_phrases

[OUT]:

set(['drinks while strutting', 'the state face', 'and since that'])

所以現在讓我們用相對頻率打印出來:

with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

ngramcombined.csv看起來像這樣:

drinks while strutting,-0.69314718056
the state face,-1.09861228867
and since that,-0.69314718056

這是完整的代碼:

import csv, math

# Writes a dummy csv file as example.
textfile = """drinks while strutting, 4, 1.435486010883783160220299732E-8
and since that, 6, 4.306458032651349480660899195E-8
the state face, 3, 2.153229016325674740330449597E-8"""

textfile2 = """and since that, 3, 2.1532290163256747e-08
the state face, 1, 7.1774300544189156e-09
drinks while strutting, 2, 7.1774300544189156e-09
some silly ngram, 99, 1.235492312e-09"""

with open('ngrams-1.csv', 'w') as fout:
    for line in textfile.split('\n'):
        fout.write(line + '\n')

with open('ngrams-2.csv', 'w') as fout:
    for line in textfile2.split('\n'):
        fout.write(line + '\n')


# Read and save the two files into a dict structure

ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {}
ngramdict2 = {}

with open(ngramfile1, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict1[phrase] = math.log(float(rel))

with open(ngramfile2, 'r') as fin:
    reader = csv.reader(fin, delimiter=',')
    for row in reader:
        phrase, raw, rel = row
        ngramdict2[phrase] = math.log(float(rel))


# Find the intersecting phrases.
phrases1 = set(ngramdict1.keys())
phrases2 = set(ngramdict2.keys())
overlap_phrases = phrases1.intersection(phrases2)

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        relfreq1 = ngramdict1[p]
        relfreq2 = ngramdict2[p]
        combined_relfreq = relfreq2 - relfreq1
        fout.write(",".join([p, str(combined_relfreq)])+ '\n')

如果您喜歡SUPER UNREADBLE但是短代碼(在行數中):

import csv, math
# Read and save the two files into a dict structure
ngramfile1 = 'ngrams-1.csv'
ngramfile2 = 'ngrams-2.csv'

ngramdict1 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile1, 'r'), delimiter=',')}
ngramdict2 = {row[0]:math.log(float(row[2])) for row in csv.reader(open(ngramfile2, 'r'), delimiter=',')}

# Find the intersecting phrases.
overlap_phrases = set(ngramdict1.keys()).intersection(set(ngramdict2.keys()))

# Output to new file.
with open('ngramcombined.csv', 'w') as fout:
    for p in overlap_phrases:
        fout.write(",".join([p, str(ngramdict2[p] - ngramdict1[p])])+ '\n')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM