improve performance fetching data from mongodb

Question

earnings = self.collection.find({}) #return 60k documents
----
data_dic = {'score': [], "reading_time": [] }
for earning in earnings:
    data_dic['reading_time'].append(earning["reading_time"])
    data_dic['score'].append(earning["score"])
----
df = pd.DataFrame()
df['reading_time'] = data_dic["reading_time"]
df['score'] = data_dic["score"]

The code between --- takes 4 seconds to complete. How can I improve this function ?

Answer 1

The time consists of these parts: Mongodb query time, time used to transfer data, network round trip, python list operation. You can optimize each of them.

One is to reduce data amount to transfer. Since you only need reading_time and score , you can only fetch them. If your average document size is big, this approach is very effective.

earnings = self.collection.find({}, {'reading_time': True, 'score': True})

Second. Mongo transfer limited amount of data in a batch. The data contains up to 60k rows, it will take multiple times to transfer data. You can adjust cursor.batchSize to reduce round trip count.

Third, increase network bandwith if you can.

Fourth. You can accelerate by leveraging numpy array . It's a C-like array data structure which is faster than python list. Pre-allocate fixed length array and assigne value by index. This avoid internal adjustment when call list.append .

count = earnings.count()
score = np.empty((count,), dtype=float)
reading_time = np.empty((count,), dtype='datetime64[us]')
for i, earning in enumerate(earnings):
    score[i] = earning["score"]
    reading_time[i] = earning["reading_time"]

df = pd.DataFrame()
df['reading_time'] = reading_time
df['score'] = score

improve performance fetching data from mongodb

Question

1 answers

solution1
4 ACCPTED 2016-12-15 15:18:29

improve performance fetching data from mongodb

Question

1 answers

solution1 4 ACCPTED 2016-12-15 15:18:29

solution1
4 ACCPTED 2016-12-15 15:18:29