簡體   English   中英

Python-對列進行排序並比較csv中的行

[英]Python - order columns and compare rows in csv

我有一個來自網絡捕獲的.csv文件。 在此文件中,我需要識別出重復的消息。

發送者A,接收者G,43,信息...
發送者H,接收者R,43,信息...
發件人A,收件人G,27,信息...
發送者N,接收者Z,43,信息...
發件人A,收件人G,1367,信息...
發送者R,接收者P,43,信息...
發送者A,接收者G,43,信息...
發送者H,接收者R,111,信息...

重復的參數是標識符,但這並不一定意味着重復消息。 在這種情況下,我還需要檢查發送者和接收者。 我考慮過按文件的第三列進行排序,然后在比較這些列中的值時從上到下循環。 雖然我設法隔離了文件中帶有重復數字的行,但問題出在這里首先,我沒有設法正確地排序它;其次,我不知道如何閱讀並同時比較一列(或在我的情況下是兩個),其值如下。 我認為這個想法將包含一個注釋,如果(如果row [2] == row [2,下一行],然后檢查row [0]和row [1]是否==到行[0和1,下一行]) ,但經過很長一段時間的思考,我沒有設法做出任何可以進行比較的體面的事情。

想法是打印或保存案例,其中同時重復這3個條件(基本上是前三列)。

發送者A,接收者G,43,信息...
發送者A,接收者G,43,信息...

也許我使它變得太復雜了,並且有一種更簡單或更快速的方法。 無論如何,我發布了我的代碼,如果有人幫助我將不勝感激。 問候

entries = []
duplicated = []

with open('file.csv', 'rt') as my_file:
    for line in my_file:
        columns = line.strip().split(',')
        if columns[2] not in entries:
            entries.append(columns[2])
        else:
            duplicated.append(columns[2]) 

#List with duplicated=null->no error
if duplicated==[]:
    print "\nNo duplicated\n"

#Other case, there might be duplicates
else:
    #Store error cases in New.csv
    with open('New.csv', 'w') as out_file:
        with open('file.csv', 'r') as my_file:
            for line in my_file:
                columns = line.strip().split(',')
                if columns[2] in duplicate_entries:
                    out_file.write(line)

#TO SORT THE EXCEL FILE. CURRENTLY NOT WORKING PROPERLY
##    data = csv.reader(open('Other.csv'),delimiter=',')
##    sortedlist = sorted(data, key=operator.itemgetter(2), reverse=True)
##    with open('Other.csv', 'w') as out_file:
##        for item in sortedlist:
##            out_file.write(item)

確實沒有必要對文件進行排序,但是您的排序可能與字符串與數字的排序不符; 字符串按字典順序排序,這意味着'10' '2' 之前排序因為1在字符集中排較早位置,而0在字符集中不起作用。

您可以通過將重復序列存儲在字典中來跟蹤它們。 這可以讓您查找以前看過的比賽。 使用collections.defaultdict()最簡單:

import csv
from collections import defaultdict

seen = defaultdict(list)

with open('file.csv', 'rb') as my_file:
    reader = csv.reader(my_file)
    for row in reader:
        key = (row[0], row[1], row[2])  # sender, receiver, id
        seen[key].append(row)

    with open('new.csv', 'wb') as outf:
        writer = csv.writer(outf)
        for collected in seen.values():
            if len(collected) > 1:
                writer.writerows(collected)

此版本按(發送者,接收者,id)三元組將輸入CSV中的行分組,然后再次將所有行寫出,但前提是每個三元組中有多行。

您也可以保持計數; 計算在字典中看到三胞胎的頻率; 一個collections.Counter()會很容易並且隨后提供按頻率排序:

import csv
from collections import Counter

with open('file.csv', 'rb') as my_file:
    reader = csv.reader(my_file)
    counts = Counter((r[0], r[1], r[2]) for r in reader)

with open('new.csv', 'wb') as outf:
    writer = csv.writer(outf)
    for (sender, receiver, id), count in counts.most_common():
        writer.writerow([sender, receiver, id, count])

使用示例數據進行演示:

>>> import csv
>>> from collections import defaultdict
>>> sample = '''\
... Sender A,Receiver G,43,Info...
... Sender H,Receiver R,43,Info...
... Sender A,Receiver G,27,Info...
... Sender N,Receiver Z,43,Info...
... Sender A,Receiver G,1367,Info...
... Sender R,Receiver P,43,Info...
... Sender A,Receiver G,43,Info...
... Sender H,Receiver R,111,Info...
... '''.splitlines(True)
>>> seen = defaultdict(list)
>>> reader = csv.reader(sample)
>>> for row in reader:
...     key = (row[0], row[1], row[2])  # sender, receiver, id
...     seen[key].append(row)
... 
>>> import sys
>>> writer = csv.writer(sys.stdout)
>>> for collected in seen.values():
...     if len(collected) > 1:
...         writer.writerows(collected)
... 
Sender A,Receiver G,43,Info...
Sender A,Receiver G,43,Info...

Counter方法:

>>> from collections import Counter
>>> reader = csv.reader(sample)
>>> counts = Counter((r[0], r[1], r[2]) for r in reader)
>>> writer = csv.writer(sys.stdout)
>>> for (sender, receiver, id), count in counts.most_common():
...     writer.writerow([sender, receiver, id, count])
... 
Sender A,Receiver G,43,2
Sender A,Receiver G,1367,1
Sender A,Receiver G,27,1
Sender N,Receiver Z,43,1
Sender H,Receiver R,111,1
Sender H,Receiver R,43,1
Sender R,Receiver P,43,1

Martijn Pieters向您展示了“純” Python中非常好的解決方案
我給你展示了一些不同的東西- pandas模塊的例子

(我使用StringIO模擬文件讀取)

data = """Sender A,Receiver G,43,Info...
Sender H,Receiver R,43,Info...
Sender A,Receiver G,27,Info...
Sender N,Receiver Z,43,Info...
Sender A,Receiver G,1367,Info...
Sender R,Receiver P,43,Info...
Sender A,Receiver G,43,Info...
Sender H,Receiver R,111,Info..."""

import pandas as pd
from StringIO import StringIO 

# read all file
df = pd.read_csv(StringIO(data), index_col=None, header=None)

print df

# group rows by values in columns 0, 1, 2
for name, group in df.groupby([0,1,2]):
    print '\n', '-'*40, '\n'
    print 'name:', name
    print 'len:', len(group)
    print
    print group

    if len(group) > 1:
        # append (`mode='a'`) data to `results.csv`
        group.to_csv('results.csv', mode='a', header=False, index=False)
        #group.to_csv('results.csv', mode='a', header=False)

我使用pd.read_csv()讀取所有文件。
(我假設在文件header=None沒有帶有標題的行
而且我不使用任何列作為行索引器index_col=None

然后,我按列0、1、2中的值對行進行分組(並打印)。
如果任何組中有多個元素,則將其附加到文件'results.csv'

我收到文件

Sender A,Receiver G,43,Info...
Sender A,Receiver G,43,Info...

或者如果我在`to_csv()中不使用index=False ,我也會得到行號(索引)

0,Sender A,Receiver G,43,Info...
6,Sender A,Receiver G,43,Info...

這就是我在屏幕上打印的

          0           1     2        3
0  Sender A  Receiver G    43  Info...
1  Sender H  Receiver R    43  Info...
2  Sender A  Receiver G    27  Info...
3  Sender N  Receiver Z    43  Info...
4  Sender A  Receiver G  1367  Info...
5  Sender R  Receiver P    43  Info...
6  Sender A  Receiver G    43  Info...
7  Sender H  Receiver R   111  Info...

---------------------------------------- 

name: ('Sender A', 'Receiver G', 27)
len: 1

          0           1   2        3
2  Sender A  Receiver G  27  Info...

---------------------------------------- 

name: ('Sender A', 'Receiver G', 43)
len: 2

          0           1   2        3
0  Sender A  Receiver G  43  Info...
6  Sender A  Receiver G  43  Info...

---------------------------------------- 

name: ('Sender A', 'Receiver G', 1367)
len: 1

          0           1     2        3
4  Sender A  Receiver G  1367  Info...

---------------------------------------- 

name: ('Sender H', 'Receiver R', 43)
len: 1

          0           1   2        3
1  Sender H  Receiver R  43  Info...

---------------------------------------- 

name: ('Sender H', 'Receiver R', 111)
len: 1

          0           1    2        3
7  Sender H  Receiver R  111  Info...

---------------------------------------- 

name: ('Sender N', 'Receiver Z', 43)
len: 1

          0           1   2        3
3  Sender N  Receiver Z  43  Info...

---------------------------------------- 

name: ('Sender R', 'Receiver P', 43)
len: 1

          0           1   2        3
5  Sender R  Receiver P  43  Info...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM