Import data from mongodb

Question

When I import data from mongodb using jupyter, during the import process, my memory is 35% but my cpu is between 100% to 135%. It takes such a long time to import data. But I am not sure where is the bottleneck in this case. My code is below

So according to the other users in SO, this is already indexed. What else can I do to speed up the process of importing data into my pc ?

{
    "_id" : ObjectId("5ad0ade0bef1fc2fba99489d"),
    "property_a" : 0.0,
    "property_b" : 0.0,
    "property_c" : 0.0,
    "property_d" : 0.0,
    "property_e" : 0.0,
.....

}

The code I use to import the data is as follows, and I execute it via the jupyter notebook. Kindly be clear on whether the edits should be on the juypter notebook or in the mongodb, in my case I use robo3t.

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

Answer 1

You can check db.currentOp() function lists -> it show the currently running queries with detailed information. It also shows the duration time they have been running (secs_running) .

Check the MongoDB console and run this function and after that analise the output,

db.currentOp({"sec_srunning": {gte: 3}})

If you CPU load is 100%, you can use different values, and you can minimise the output and find that 'bad query'.

The most important keys to analize: active: means the query is 'in progess' state secs_running:query's duration, in seconds ns:a collection name against you perform the query query: the query body So now we know how to find slow queries that can lead to high CPU load.

This solution in this link: https://medium.com/quickblox-engineering/troubleshooting-mongodb-100-cpu-load-and-slow-queries-da622c6e1339 You can check more information in there.

Import data from mongodb

Question

1 answers

solution1
0 ACCPTED 2018-04-20 09:19:46

Import data from mongodb

Question

1 answers

solution1 0 ACCPTED 2018-04-20 09:19:46

solution1
0 ACCPTED 2018-04-20 09:19:46