I am querying a collection using pymongo :
import pymongo
client = pymongo.MongoClient('0.0.0.0', 27017)
db = client.documents
collection = db.collections
test_data = collection.find_one({'metadata.encodingStage.terms.data.line.data.account.shortDescription': {'$exists': True}},
{'metadata.encodingStage.terms.data.line.data.account.shortDescription': 1})
I am using find_one
here for illustration, but in practice this is a find
query across the whole collection.
This gives the following output:
{'_id': ObjectId('5a2fb9371de46756df51f37b'),
'metadata': {'encodingStage': {'terms': {'data':
{'line': [{'data': {'account': {'shortDescription': ['123456']}}},
{'data': {'account': {'shortDescription': ['7890123']}}}]}}}}}
However, I would like the data in tabular format, as per SQL or Pandas:
_id shortDescription
-------------------------------------------------------
ObjectId('5a2fb9371de46756df51f37b') 123456
ObjectId('5a2fb9371de46756df51f37b') 7890123
I understand how to do this in Python, looping over the results, but for computational efficiency, I would like more of the tabulation to happen in Mongo.
Is there a simple way to use pymongo to output the results as {'_id': 'XXX', 'shortDescription': 'XXX')
pairs which can be efficiently tabulated?
Unwind aggregation?
I have attempted to do this as an $unwind
aggregation:
unwind = collection.aggregate([{'$unwind': '$metadata.encodingStage.terms.data.line.data.account.shortDescription'}])
...but this returns no data.
Solution
First problem with my logic was that line
was an array, so needed to be unwound first.
Combining that with two $project
steps and a final $unwind
on the leaf array flattens out the data to give (_id, shortDescription)
pairs that can be quickly transformed to a pandas DataFrame:
db.collection.aggregate([
{"$project": {"line": "$metadata.encodingStage.terms.data.line"}},
{"$unwind": "$line"},
{"$project": {"shortDescription": "$line.data.account.shortDescription"}},
{"$unwind": "$shortDescription"}
])
Output:
[{'_id': ObjectId('xxxxxxxxxxxxxxx123'), 'shortDescription': '12340000'},
{'_id': ObjectId('xxxxxxxxxxxxxxx123'), 'shortDescription': '43210000'},
{'_id': ObjectId('yyyyyyyyyyyyyyy789'), 'shortDescription': '56780000'},
{'_id': ObjectId('yyyyyyyyyyyyyyy789'), 'shortDescription': '78920000'},
{'_id': ObjectId('yyyyyyyyyyyyyyy789'), 'shortDescription': '55550000'}]
Which can be loaded in pandas with no additional transformations:
import pandas as pd
results = db.collection.aggregate([
{"$project": {"line": "$metadata.encodingStage.terms.data.line"}},
{"$unwind": "$line"},
{"$project": {"shortDescription": "$line.data.account.shortDescription"}},
{"$unwind": "$shortDescription"}
])
df = pd.DataFrame([item for item in results])
Output:
print(df)
_id shortDescription
0 xxxxxxxxxxxxxxx123 12340000
1 xxxxxxxxxxxxxxx123 43210000
2 yyyyyyyyyyyyyyy789 56780000
3 yyyyyyyyyyyyyyy789 78920000
4 yyyyyyyyyyyyyyy789 55550000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.