简体   繁体   中英

Import data from mongodb

When I import data from mongodb using jupyter, during the import process, my memory is 35% but my cpu is between 100% to 135%. It takes such a long time to import data. But I am not sure where is the bottleneck in this case. My code is below

So according to the other users in SO, this is already indexed. What else can I do to speed up the process of importing data into my pc ?

{
    "_id" : ObjectId("5ad0ade0bef1fc2fba99489d"),
    "property_a" : 0.0,
    "property_b" : 0.0,
    "property_c" : 0.0,
    "property_d" : 0.0,
    "property_e" : 0.0,
.....

}

The code I use to import the data is as follows, and I execute it via the jupyter notebook. Kindly be clear on whether the edits should be on the juypter notebook or in the mongodb, in my case I use robo3t.

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

You can check db.currentOp() function lists -> it show the currently running queries with detailed information. It also shows the duration time they have been running (secs_running) .

Check the MongoDB console and run this function and after that analise the output,

db.currentOp({"sec_srunning": {gte: 3}})

If you CPU load is 100%, you can use different values, and you can minimise the output and find that 'bad query'.

The most important keys to analize: active: means the query is 'in progess' state secs_running:query's duration, in seconds ns:a collection name against you perform the query query: the query body So now we know how to find slow queries that can lead to high CPU load.

This solution in this link: https://medium.com/quickblox-engineering/troubleshooting-mongodb-100-cpu-load-and-slow-queries-da622c6e1339 You can check more information in there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM