將Mongodb導入CSV - 刪除重復項

Question

我將Mongo中的數據導入CSV文件。 導入包含每個JSON文檔的“時間戳”和“文本”。

文件：

{ 
name: ..., 
size: ..., 
timestamp: ISODate("2013-01-09T21:04:12Z"), 
data: { text:..., place:...},
other: ...
}

編碼：

with open(output, 'w') as fp:
   for r in db.hello.find(fields=['text', 'timestamp']):
       print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))

我想刪除重復項（一些Mongo文檔具有相同的文本），我想保持第一個實例（關於時間）完好無損。 我導入時是否可以刪除這些欺騙？

謝謝你的幫助！

Answer 1

我會使用一個集來存儲數據的哈希值，並檢查重復項。 像這樣的東西：

import md5

hashes = set()
with open(output, 'w') as fp:
   for r in db.hello.find(fields=['text', 'timestamp']):
       digest = md5.new(r['text']).digest()
       if digest in hashes:
            # It's a duplicate!
            continue
       else:
            hashes.add(digest)
       print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))

值得注意的是，您可以直接使用文本字段，但對於存儲僅哈希的較大文本字段，內存效率要高得多。

Answer 2

您只需要維護一個地圖（字典）來維護（文本，時間戳）對。 “文本”是關鍵，因此不會有任何重復。 我將假設讀取順序不能保證首先返回最舊的時間戳。 在這種情況下，你必須進行2次傳球 - 一次用於閱讀，之后一次用於寫作。

textmap = {}

def  insert(text, ts):
    global textmap
    if  text in textmap: 
        textmap[text] = min(ts, textmap[text])
    else:
        textmap[text] = ts

for r in db.hello.find(fields=['text', 'timestamp']):
    insert(r['text'], r['timestamp'])

for text in textmap:
   print >>fp, text, textmap[text]  # with whatever format desired.

最后，您還可以輕松地將字典轉換為元組列表，以防您希望在打印前使用時間戳對結果進行排序。
（請參閱按值排序Python字典）

將Mongodb導入CSV - 刪除重復項

問題描述

2 個解決方案

解決方案1
3 2013-01-10 18:26:04

解決方案2
1 已采納 2013-01-10 19:32:45

將Mongodb導入CSV - 刪除重復項

問題描述

2 個解決方案

解決方案1 3 2013-01-10 18:26:04

解決方案2 1 已采納 2013-01-10 19:32:45

解決方案1
3 2013-01-10 18:26:04

解決方案2
1 已采納 2013-01-10 19:32:45