简体   繁体   中英

Import Mongodb to CSV - removing duplicates

I am importing data from Mongo into a CSV file. The import consists of "timestamp" and "text" for each JSON Document.

The documents:

{ 
name: ..., 
size: ..., 
timestamp: ISODate("2013-01-09T21:04:12Z"), 
data: { text:..., place:...},
other: ...
}

The code:

with open(output, 'w') as fp:
   for r in db.hello.find(fields=['text', 'timestamp']):
       print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))

I would like to remove duplicates (some Mongo docs have the same text), and I would like to keep the first instance (with regards to the time) intact. Is it possible to remove these dupes as I import?

Thanks for your help!

I would use a set to store the hashes of the data, and check for duplicates. Something like this:

import md5

hashes = set()
with open(output, 'w') as fp:
   for r in db.hello.find(fields=['text', 'timestamp']):
       digest = md5.new(r['text']).digest()
       if digest in hashes:
            # It's a duplicate!
            continue
       else:
            hashes.add(digest)
       print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))

It's worth noting that you could use the text field directly, but for larger text fields storing just the hash is much more memory efficient.

You just need to maintain a map (dictionary) to maintain (text, timestamp) pairs. The 'text' is the key, so there won't be any duplicates. I will assume the order of reading is not guaranteed to return the oldest timestamp first. In that case you will have to make 2 passes-- once for reading and later one pass for writing.

textmap = {}

def  insert(text, ts):
    global textmap
    if  text in textmap: 
        textmap[text] = min(ts, textmap[text])
    else:
        textmap[text] = ts

for r in db.hello.find(fields=['text', 'timestamp']):
    insert(r['text'], r['timestamp'])

for text in textmap:
   print >>fp, text, textmap[text]  # with whatever format desired.

At the end, you can also easily convert the dictionary into list of tuples, in case you want to sort the results using timestamp before printing, for example.
(See Sort a Python dictionary by value )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM