Mongodb Import is very slow, how can I improve the performance?

Question

I was wondering how I can improve the import performance of data in Mongodb. I have 17700 txt files and to import them I have to first turn them into a dictionary and then import them into Mongo, but using the loop for the process is really too slow, any suggestions? Thank you This is my code:

from bson.objectid import ObjectId
   def txt_dict(x):
       d = {}
       with open(x,'r') as inf:
       conta=0
       for line in inf:
          if (conta == 0):
            movie_id = line.replace(":","")
            conta = conta+1   
          else:
              d['user_id'] = line.split(sep = ',')[0]
              d['rating'] = int(line.split(sep = ',')[1])
              d['date'] = line.split(sep = ',')[2]
              d['_id'] = ObjectId()
            d['movie_id'] = movie_id
            collection.insert(d)
 import os
directory = 
r"/Users/lorenzofamiglini/Desktop/Data_Science/training_set"
for filename in os.listdir(directory):
    if filename.endswith('.txt'):
       txt_dict((directory+"/"+filename))
    #print (str(directory+"/"+filename))

Answer 1

Two ways to improve performance.

Use insert_many to insert records in bulk (I recommend batches of 1000)
Process files in parallel either by running multiple instances of your program in parallel or by using multiprocessing .

Any database is constrained by disk write speed on a single insert but is very efficient at batching together multiple insert operations. By parallelizing your loading you can saturate the disk.

In short it will run faster. After that you are into parallelizing your writes with multiple disk drives and using SSDs.

With MongoDB Atlas you can turn up the IOPS rate (Input Output Operations) during data loads and dial it down afterwards. Always an option if you are in the cloud.

Mongodb Import is very slow, how can I improve the performance?

Question

1 answers

solution1
0 2019-04-12 11:36:04

Mongodb Import is very slow, how can I improve the performance?

Question

1 answers

solution1 0 2019-04-12 11:36:04

solution1
0 2019-04-12 11:36:04