简体   繁体   中英

improve performance fetching data from mongodb

earnings = self.collection.find({}) #return 60k documents
----
data_dic = {'score': [], "reading_time": [] }
for earning in earnings:
    data_dic['reading_time'].append(earning["reading_time"])
    data_dic['score'].append(earning["score"])
----
df = pd.DataFrame()
df['reading_time'] = data_dic["reading_time"]
df['score'] = data_dic["score"]

The code between --- takes 4 seconds to complete. How can I improve this function ?

The time consists of these parts: Mongodb query time, time used to transfer data, network round trip, python list operation. You can optimize each of them.

One is to reduce data amount to transfer. Since you only need reading_time and score , you can only fetch them. If your average document size is big, this approach is very effective.

earnings = self.collection.find({}, {'reading_time': True, 'score': True})

Second. Mongo transfer limited amount of data in a batch. The data contains up to 60k rows, it will take multiple times to transfer data. You can adjust cursor.batchSize to reduce round trip count.

Third, increase network bandwith if you can.

Fourth. You can accelerate by leveraging numpy array . It's a C-like array data structure which is faster than python list. Pre-allocate fixed length array and assigne value by index. This avoid internal adjustment when call list.append .

count = earnings.count()
score = np.empty((count,), dtype=float)
reading_time = np.empty((count,), dtype='datetime64[us]')
for i, earning in enumerate(earnings):
    score[i] = earning["score"]
    reading_time[i] = earning["reading_time"]

df = pd.DataFrame()
df['reading_time'] = reading_time
df['score'] = score

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM