I am importing data from Mongo into a CSV file. The import consists of "timestamp" and "text" for each JSON Document.
The documents:
{
name: ...,
size: ...,
timestamp: ISODate("2013-01-09T21:04:12Z"),
data: { text:..., place:...},
other: ...
}
The code:
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
I would like to remove duplicates (some Mongo docs have the same text), and I would like to keep the first instance (with regards to the time) intact. Is it possible to remove these dupes as I import?
Thanks for your help!
I would use a set to store the hashes of the data, and check for duplicates. Something like this:
import md5
hashes = set()
with open(output, 'w') as fp:
for r in db.hello.find(fields=['text', 'timestamp']):
digest = md5.new(r['text']).digest()
if digest in hashes:
# It's a duplicate!
continue
else:
hashes.add(digest)
print >>fp, '"%s","%s"' % (r['text'], r['timestamp'].strftime('%H:%M:%S'))
It's worth noting that you could use the text field directly, but for larger text fields storing just the hash is much more memory efficient.
You just need to maintain a map (dictionary) to maintain (text, timestamp) pairs. The 'text' is the key, so there won't be any duplicates. I will assume the order of reading is not guaranteed to return the oldest timestamp first. In that case you will have to make 2 passes-- once for reading and later one pass for writing.
textmap = {} def insert(text, ts): global textmap if text in textmap: textmap[text] = min(ts, textmap[text]) else: textmap[text] = ts for r in db.hello.find(fields=['text', 'timestamp']): insert(r['text'], r['timestamp']) for text in textmap: print >>fp, text, textmap[text] # with whatever format desired.
At the end, you can also easily convert the dictionary into list of tuples, in case you want to sort the results using timestamp before printing, for example.
(See Sort a Python dictionary by value )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.