简体   繁体   中英

Tabulate deeply nested MongoDB collection using PyMongo

I am querying a collection using pymongo :

import pymongo

client = pymongo.MongoClient('0.0.0.0', 27017)
db = client.documents
collection = db.collections
test_data = collection.find_one({'metadata.encodingStage.terms.data.line.data.account.shortDescription': {'$exists': True}}, 
{'metadata.encodingStage.terms.data.line.data.account.shortDescription': 1})

I am using find_one here for illustration, but in practice this is a find query across the whole collection.

This gives the following output:

{'_id': ObjectId('5a2fb9371de46756df51f37b'),
 'metadata': {'encodingStage': {'terms': {'data':
    {'line': [{'data': {'account': {'shortDescription': ['123456']}}},
              {'data': {'account': {'shortDescription': ['7890123']}}}]}}}}}

However, I would like the data in tabular format, as per SQL or Pandas:

                               _id    shortDescription
-------------------------------------------------------
ObjectId('5a2fb9371de46756df51f37b')            123456
ObjectId('5a2fb9371de46756df51f37b')           7890123

I understand how to do this in Python, looping over the results, but for computational efficiency, I would like more of the tabulation to happen in Mongo.

Is there a simple way to use pymongo to output the results as {'_id': 'XXX', 'shortDescription': 'XXX') pairs which can be efficiently tabulated?

Unwind aggregation?

I have attempted to do this as an $unwind aggregation:

unwind = collection.aggregate([{'$unwind': '$metadata.encodingStage.terms.data.line.data.account.shortDescription'}])

...but this returns no data.

Solution

First problem with my logic was that line was an array, so needed to be unwound first.

Combining that with two $project steps and a final $unwind on the leaf array flattens out the data to give (_id, shortDescription) pairs that can be quickly transformed to a pandas DataFrame:

db.collection.aggregate([
                           {"$project": {"line": "$metadata.encodingStage.terms.data.line"}},
                           {"$unwind": "$line"},
                           {"$project": {"shortDescription": "$line.data.account.shortDescription"}},
                           {"$unwind": "$shortDescription"}
                        ])

Output:

[{'_id': ObjectId('xxxxxxxxxxxxxxx123'), 'shortDescription': '12340000'},
 {'_id': ObjectId('xxxxxxxxxxxxxxx123'), 'shortDescription': '43210000'},
 {'_id': ObjectId('yyyyyyyyyyyyyyy789'), 'shortDescription': '56780000'},
 {'_id': ObjectId('yyyyyyyyyyyyyyy789'), 'shortDescription': '78920000'},
 {'_id': ObjectId('yyyyyyyyyyyyyyy789'), 'shortDescription': '55550000'}]

Which can be loaded in pandas with no additional transformations:

import pandas as pd
results = db.collection.aggregate([
                           {"$project": {"line": "$metadata.encodingStage.terms.data.line"}},
                           {"$unwind": "$line"},
                           {"$project": {"shortDescription": "$line.data.account.shortDescription"}},
                           {"$unwind": "$shortDescription"}
                        ])
df = pd.DataFrame([item for item in results])

Output:

print(df)

                  _id shortDescription
0  xxxxxxxxxxxxxxx123         12340000
1  xxxxxxxxxxxxxxx123         43210000
2  yyyyyyyyyyyyyyy789         56780000
3  yyyyyyyyyyyyyyy789         78920000
4  yyyyyyyyyyyyyyy789         55550000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM