When I import data from mongodb using jupyter, during the import process, my memory is 35% but my cpu is between 100% to 135%. It takes such a long time to import data. But I am not sure where is the bottleneck in this case. My code is below
So according to the other users in SO, this is already indexed. What else can I do to speed up the process of importing data into my pc ?
{
"_id" : ObjectId("5ad0ade0bef1fc2fba99489d"),
"property_a" : 0.0,
"property_b" : 0.0,
"property_c" : 0.0,
"property_d" : 0.0,
"property_e" : 0.0,
.....
}
The code I use to import the data is as follows, and I execute it via the jupyter notebook. Kindly be clear on whether the edits should be on the juypter notebook or in the mongodb, in my case I use robo3t.
import pandas as pd
from pymongo import MongoClient
def _connect_mongo(host, port, username, password, db):
""" A util for making a connection to mongo """
if username and password:
mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
conn = MongoClient(mongo_uri)
else:
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db[collection].find(query)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df
You can check db.currentOp()
function lists -> it show the currently running queries with detailed information. It also shows the duration time they have been running (secs_running)
.
Check the MongoDB console and run this function and after that analise the output,
db.currentOp({"sec_srunning": {gte: 3}})
If you CPU load is 100%, you can use different values, and you can minimise the output and find that 'bad query'.
The most important keys to analize: active: means the query is 'in progess' state secs_running:query's duration, in seconds ns:a collection name against you perform the query query: the query body So now we know how to find slow queries that can lead to high CPU load.
This solution in this link: https://medium.com/quickblox-engineering/troubleshooting-mongodb-100-cpu-load-and-slow-queries-da622c6e1339 You can check more information in there.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.