[英]Slow data loading to MongoDB from pandas Dataframe
I have 2 large csv files that need to be loaded to Mongo collections. 我有2个大型CSV文件需要加载到Mongo集合中。 Firstly, I read the data into pandas Dataframe, do some preprocessing and after that insert the resulting dict into Mongo collection. 首先,我将数据读入pandas Dataframe中,进行一些预处理,然后将结果字典插入Mongo集合中。 The problem is that the performance is very slow, because it's executed sequentially and loading data to second collection should be done after first collection is already filled (to update the rows with foreign keys). 问题在于性能很慢,因为它是按顺序执行的,并且应该在第一个集合已填充完毕(用外键更新行)之后才将数据加载到第二个集合中。 How can I speed up the process of loading? 如何加快加载过程?
import pymongo
import config
import pandas as pd
import numpy as np
from datetime import datetime
from config import logger
client = pymongo.MongoClient(config.IP)
try:
client.server_info()
except pymongo.errors.ServerSelectionTimeoutError as e:
logger.error("Unable to connect to %s. Error: %s" % (config.IP, e))
client = None
# connect to database (or create if not exists)
mydb = client[config.DB_NAME]
# connect to collections (or create if not exists)
movie_collection = mydb[config.DB_MOVIE_COLLECTION]
actors_collection = mydb[config.DB_ACTOR_COLLECTION]
def read_data(file):
'''
returns Dataframe with read csv data
'''
df = pd.read_csv(file, sep='\t')
df.replace('\\N', np.nan, inplace=True)
return df
def insert_to_collection(collection, data):
collection.insert(data)
def fill_movie_data():
'''
iterates over movie Dataframe
process values and creates dict structure
with specific attributes to insert into MongoDB movie collection
'''
# load data to pandas Dataframe
logger.info("Reading movie data to Dataframe")
data = read_data('datasets/title.basics.tsv')
for index, row in data.iterrows():
result_dict = {}
id_ = row['tconst']
title = row['primaryTitle']
# check value of movie year (if not NaN)
if not pd.isnull(row['endYear']) and not pd.isnull(row['startYear']):
year = list([row['startYear'], row['endYear']])
elif not pd.isnull(row['startYear']):
year = int(row['startYear'])
else:
year = None
# check value of movie duration (if not NaN)
if not pd.isnull(row['runtimeMinutes']):
try:
duration = int(row['runtimeMinutes'])
except ValueError:
duration = None
else:
duration = None
# check value of genres (if not NaN)
if not pd.isnull(row['genres']):
genres = row['genres'].split(',')
else:
genres = None
result_dict['_id'] = id_
result_dict['primary_title'] = title
# if both years have values
if isinstance(year, list):
result_dict['year_start'] = int(year[0])
result_dict['year_end'] = int(year[1])
# if start_year has value
elif year:
result_dict['year'] = year
if duration:
result_dict['duration'] = duration
if genres:
result_dict['genres'] = genres
insert_to_collection(movie_collection, result_dict)
def fill_actors_data():
'''
iterates over actors Dataframe
process values, creates dict structure
with new fields to insert into MongoDB actors collection
'''
logger.info("Inserting data to actors collection")
# load data to pandas Dataframe
logger.info("Reading actors data to Dataframe")
data = read_data('datasets/name.basics.tsv')
logger.info("Inserting data to actors collection")
for index, row in data.iterrows():
result_dict = {}
id_ = row['nconst']
name = row['primaryName']
# if no birth year and death year value
if pd.isnull(row['birthYear']):
yob = None
alive = False
# if both birth and death year have value
elif not pd.isnull(row['birthYear']) and not pd.isnull(row['deathYear']):
yob = int(row['birthYear'])
death = int(row['deathYear'])
age = death - yob
alive = False
# if only birth year has value
else:
yob = int(row['birthYear'])
current_year = datetime.now().year
age = current_year - yob
alive = True
if not pd.isnull(row['knownForTitles']):
movies = row['knownForTitles'].split(',')
result_dict['_id'] = id_
result_dict['name'] = name
result_dict['yob'] = yob
result_dict['alive'] = alive
result_dict['age'] = age
result_dict['movies'] = movies
insert_to_collection(actors_collection, result_dict)
# update movie documents with list of actors ids
movie_collection.update_many({"_id": {"$in": movies}}, {"$push": { "people": id_}})
# if collections are empty, fill it with data
if movie_collection.count() == 0:
fill_movie_data()
if actors_collection.count() == 0:
fill_actors_data()
Instead of inserting one record at a time, insert in bulk . 批量插入无需一次插入一个记录。
insert_many
At the moment you have: 目前,您有:
def insert_to_collection(collection: pymongo.collection.Collection, data: dict):
collection.insert(data)
you are using insert()
which is deprecated, by the way. 顺便说一下,您正在使用不推荐使用的insert()
。
What you want to have is: 您想要拥有的是:
def insert_to_collection(collection: pymongo.collection.Collection, data: list):
collection.insert_many(data)
So in your two functions: fill_movie_data
and fill_actors_data
, instead of calling insert_to_collection()
all the time in the loop, you can call it once in a while and insert in bulk. 因此,在您的两个函数fill_movie_data
和fill_actors_data
中,您可以不时调用一次并批量插入,而不是insert_to_collection()
在循环中调用insert_to_collection()
。
Below is the code you posted with a few modifications: 以下是您发布的代码,并进行了一些修改:
Add a max_bulk_size
which the larger the better for your speed, just make sure it doesn't exceed your RAM. 添加一个max_bulk_size
,越大越适合您的速度,只需确保它不超过RAM。
max_bulk_size = 500
Add a results_list
and append result_dict
to it. 添加一个results_list
并将result_dict
附加到它。 Once the size of the list reaches the max_bulk_size
, save it and empty the list. 一旦列表的大小达到max_bulk_size
,将其保存并清空列表。
def fill_movie_data():
'''
iterates over movie Dataframe
process values and creates dict structure
with specific attributes to insert into MongoDB movie collection
'''
# load data to pandas Dataframe
logger.info("Reading movie data to Dataframe")
data = read_data('datasets/title.basics.tsv')
results_list = []
for index, row in data.iterrows():
result_dict = {}
id_ = row['tconst']
title = row['primaryTitle']
# check value of movie year (if not NaN)
if not pd.isnull(row['endYear']) and not pd.isnull(row['startYear']):
year = list([row['startYear'], row['endYear']])
elif not pd.isnull(row['startYear']):
year = int(row['startYear'])
else:
year = None
# check value of movie duration (if not NaN)
if not pd.isnull(row['runtimeMinutes']):
try:
duration = int(row['runtimeMinutes'])
except ValueError:
duration = None
else:
duration = None
# check value of genres (if not NaN)
if not pd.isnull(row['genres']):
genres = row['genres'].split(',')
else:
genres = None
result_dict['_id'] = id_
result_dict['primary_title'] = title
# if both years have values
if isinstance(year, list):
result_dict['year_start'] = int(year[0])
result_dict['year_end'] = int(year[1])
# if start_year has value
elif year:
result_dict['year'] = year
if duration:
result_dict['duration'] = duration
if genres:
result_dict['genres'] = genres
results_list.append(result_dict)
if len(results_list) > max_bulk_size:
insert_to_collection(movie_collection, results_list)
results_list = []
Same with your other loop. 与您的其他循环相同。
def fill_actors_data():
'''
iterates over actors Dataframe
process values, creates dict structure
with new fields to insert into MongoDB actors collection
'''
logger.info("Inserting data to actors collection")
# load data to pandas Dataframe
logger.info("Reading actors data to Dataframe")
data = read_data('datasets/name.basics.tsv')
logger.info("Inserting data to actors collection")
results_list = []
for index, row in data.iterrows():
result_dict = {}
id_ = row['nconst']
name = row['primaryName']
# if no birth year and death year value
if pd.isnull(row['birthYear']):
yob = None
alive = False
# if both birth and death year have value
elif not pd.isnull(row['birthYear']) and not pd.isnull(row['deathYear']):
yob = int(row['birthYear'])
death = int(row['deathYear'])
age = death - yob
alive = False
# if only birth year has value
else:
yob = int(row['birthYear'])
current_year = datetime.now().year
age = current_year - yob
alive = True
if not pd.isnull(row['knownForTitles']):
movies = row['knownForTitles'].split(',')
result_dict['_id'] = id_
result_dict['name'] = name
result_dict['yob'] = yob
result_dict['alive'] = alive
result_dict['age'] = age
result_dict['movies'] = movies
results_list.append(result_dict)
if len(results_list) > max_bulk_size:
insert_to_collection(actors_collection, results_list)
results_list = []
# update movie documents with list of actors ids
movie_collection.update_many({"_id": {"$in": movies}}, {"$push": { "people": id_}})
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.