简体   繁体   English

将Mongodb导入CSV - 删除重复项

[英]Import Mongodb to CSV - removing duplicates

I am importing data from Mongo into a CSV file. 我将Mongo中的数据导入CSV文件。 The import consists of "timestamp" and "text" for each JSON Document. 导入包含每个JSON文档的“时间戳”和“文本”。

The documents: 文件:

{ 
name: ..., 
size: ..., 
timestamp: ISODate("2013-01-09T21:04:12Z"), 
data: { text:..., place:...},
other: ...
}

The code: 编码:

with open(output, 'w') as fp:
   for r in db.hello.find(fields=['text', 'timestamp']):
       print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))

I would like to remove duplicates (some Mongo docs have the same text), and I would like to keep the first instance (with regards to the time) intact. 我想删除重复项(一些Mongo文档具有相同的文本),我想保持第一个实例(关于时间)完好无损。 Is it possible to remove these dupes as I import? 我导入时是否可以删除这些欺骗?

Thanks for your help! 谢谢你的帮助!

I would use a set to store the hashes of the data, and check for duplicates. 我会使用一个集来存储数据的哈希值,并检查重复项。 Something like this: 像这样的东西:

import md5

hashes = set()
with open(output, 'w') as fp:
   for r in db.hello.find(fields=['text', 'timestamp']):
       digest = md5.new(r['text']).digest()
       if digest in hashes:
            # It's a duplicate!
            continue
       else:
            hashes.add(digest)
       print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))

It's worth noting that you could use the text field directly, but for larger text fields storing just the hash is much more memory efficient. 值得注意的是,您可以直接使用文本字段,但对于存储仅哈希的较大文本字段,内存效率要高得多。

You just need to maintain a map (dictionary) to maintain (text, timestamp) pairs. 您只需要维护一个地图(字典)来维护(文本,时间戳)对。 The 'text' is the key, so there won't be any duplicates. “文本”是关键,因此不会有任何重复。 I will assume the order of reading is not guaranteed to return the oldest timestamp first. 我将假设读取顺序不能保证首先返回最旧的时间戳。 In that case you will have to make 2 passes-- once for reading and later one pass for writing. 在这种情况下,你必须进行2次传球 - 一次用于阅读,之后一次用于写作。

textmap = {}

def  insert(text, ts):
    global textmap
    if  text in textmap: 
        textmap[text] = min(ts, textmap[text])
    else:
        textmap[text] = ts

for r in db.hello.find(fields=['text', 'timestamp']):
    insert(r['text'], r['timestamp'])

for text in textmap:
   print >>fp, text, textmap[text]  # with whatever format desired.

At the end, you can also easily convert the dictionary into list of tuples, in case you want to sort the results using timestamp before printing, for example. 最后,您还可以轻松地将字典转换为元组列表,以防您希望在打印前使用时间戳对结果进行排序。
(See Sort a Python dictionary by value ) (请参阅按值排序Python字典

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM