简体   繁体   中英

Storing MongoDB ObjectID's in Pandas

Once I've retrieved data from MongoDB and loaded into a Pandas dataframe, what is the recommended practice with regards to storing hexadecimal ObjectID's?

I presume that, stored as strings, they take a lot of memory, which can be limiting in very large datasets. Is it a good idea to convert them to integers (from hex to dec)? Wouldn't that decrease memory usage and speed up processing (merges, lookups...)?

And BTW, here's how I'm doing it. Is this the best way? It unfortunately fails with NaN.

tank_hist['id'] = pd.to_numeric(tank_hist['id'].apply(lambda x: int(str(x), base=16)))

First of all, I think it's NaN because object IDs are bigger than a 64bit integer. Python can handle that, but the underlying pandas/numpy may not.

I think you want to use a map to extract some useful fields that you could later do multi level sorting on. I'm not sure you'll see the performance improvement you're expecting though.

I'd start by creating a new series "oid_*" into your frame and checking your results

https://docs.mongodb.com/manual/reference/method/ObjectId/

breaks down the object id into components with:

  • timestamp (4 bytes),
  • random (5 bytes - used to be host identifier), and
  • counter (3 bytes)

these integer sizes are good and appropriate integer sizes for numpy to deal with.

tank_hist['oid_timestamp'] = tank_hist['id'].map(lambda x: int(str(x)[:8], 16))
tank_hist['oid_random'] = tank_hist['id'].map(lambda x: int(str(x[:4])[8:18], 16))
tank_hist['oid_counter'] = tank_hist['id'].map(lambda x: int(str(x[:4])[18:], 16))

This would allow you to primary sort on timestamp series, secondary sort on some other series in the frame? Then third sort on counter.

Maps are super helpful (though slow) way to poke every record in your series. Realize that if you are adding compute time here in exchange for saving this compute time later.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM