从pandas Dataframe到MongoDB的数据加载速度缓慢

Question

I have 2 large csv files that need to be loaded to Mongo collections. 我有2个大型CSV文件需要加载到Mongo集合中。 Firstly, I read the data into pandas Dataframe, do some preprocessing and after that insert the resulting dict into Mongo collection. 首先，我将数据读入pandas Dataframe中，进行一些预处理，然后将结果字典插入Mongo集合中。 The problem is that the performance is very slow, because it's executed sequentially and loading data to second collection should be done after first collection is already filled (to update the rows with foreign keys). 问题在于性能很慢，因为它是按顺序执行的，并且应该在第一个集合已填充完毕（用外键更新行）之后才将数据加载到第二个集合中。 How can I speed up the process of loading? 如何加快加载过程？

import pymongo
import config
import pandas as pd
import numpy as np
from datetime import datetime
from config import logger


client = pymongo.MongoClient(config.IP)
try:
    client.server_info()
except pymongo.errors.ServerSelectionTimeoutError as e:
    logger.error("Unable to connect to %s. Error: %s" % (config.IP, e))
    client = None

# connect to database (or create if not exists)
mydb = client[config.DB_NAME]

# connect to collections (or create if not exists)
movie_collection = mydb[config.DB_MOVIE_COLLECTION]
actors_collection = mydb[config.DB_ACTOR_COLLECTION]


def read_data(file):
    '''
    returns Dataframe with read csv data
    '''
    df = pd.read_csv(file, sep='\t')
    df.replace('\\N', np.nan, inplace=True)
    return df


def insert_to_collection(collection, data):
    collection.insert(data)


def fill_movie_data():
    '''
    iterates over movie Dataframe
    process values and creates dict structure
    with specific attributes to insert into MongoDB movie collection
    '''


    # load data to pandas Dataframe
    logger.info("Reading movie data to Dataframe")
    data = read_data('datasets/title.basics.tsv')

    for index, row in data.iterrows():
        result_dict = {}

        id_ = row['tconst']
        title = row['primaryTitle']

        # check value of movie year (if not NaN)
        if not pd.isnull(row['endYear']) and not pd.isnull(row['startYear']):
            year = list([row['startYear'], row['endYear']])
        elif not pd.isnull(row['startYear']):
            year = int(row['startYear'])
        else:
            year = None

        # check value of movie duration (if not NaN)
        if not pd.isnull(row['runtimeMinutes']):
            try:
                duration = int(row['runtimeMinutes'])
            except ValueError:
                duration = None
        else:
            duration = None

        # check value of genres (if not NaN)
        if not pd.isnull(row['genres']):
            genres = row['genres'].split(',')
        else:
            genres = None

        result_dict['_id'] = id_
        result_dict['primary_title'] = title

        # if both years have values
        if isinstance(year, list):
            result_dict['year_start'] = int(year[0])
            result_dict['year_end'] = int(year[1])

        # if start_year has value
        elif year:
            result_dict['year'] = year

        if duration:
            result_dict['duration'] = duration

        if genres:
            result_dict['genres'] = genres

        insert_to_collection(movie_collection, result_dict)



def fill_actors_data():
    '''
    iterates over actors Dataframe
    process values, creates dict structure
    with new fields to insert into MongoDB actors collection
    '''


    logger.info("Inserting data to actors collection")
    # load data to pandas Dataframe
    logger.info("Reading actors data to Dataframe")
    data = read_data('datasets/name.basics.tsv')

    logger.info("Inserting data to actors collection")
    for index, row in data.iterrows():
        result_dict = {}

        id_ = row['nconst']
        name = row['primaryName']

        # if no birth year and death year value
        if pd.isnull(row['birthYear']):
            yob = None
            alive = False
        # if both birth and death year have value
        elif not pd.isnull(row['birthYear']) and not pd.isnull(row['deathYear']):
            yob = int(row['birthYear'])
            death = int(row['deathYear'])
            age = death - yob
            alive = False
        # if only birth year has value
        else:
            yob = int(row['birthYear'])
            current_year = datetime.now().year
            age = current_year - yob
            alive = True

        if not pd.isnull(row['knownForTitles']):
            movies = row['knownForTitles'].split(',')

        result_dict['_id'] = id_
        result_dict['name'] = name
        result_dict['yob'] = yob
        result_dict['alive'] = alive
        result_dict['age'] = age
        result_dict['movies'] = movies

        insert_to_collection(actors_collection, result_dict)

        # update movie documents with list of actors ids
        movie_collection.update_many({"_id": {"$in": movies}}, {"$push": { "people": id_}})



# if collections are empty, fill it with data
if movie_collection.count() == 0:
    fill_movie_data()

if actors_collection.count() == 0:
    fill_actors_data()

Answer 1

TL;DR TL; DR

Instead of inserting one record at a time, insert in bulk . 批量插入无需一次插入一个记录。

`insert_many`

At the moment you have: 目前，您有：

def insert_to_collection(collection: pymongo.collection.Collection, data: dict):
    collection.insert(data)

you are using insert() which is deprecated, by the way. 顺便说一下，您正在使用不推荐使用的insert() 。

What you want to have is: 您想要拥有的是：

def insert_to_collection(collection: pymongo.collection.Collection, data: list):
    collection.insert_many(data)

So in your two functions: fill_movie_data and fill_actors_data , instead of calling insert_to_collection() all the time in the loop, you can call it once in a while and insert in bulk. 因此，在您的两个函数fill_movie_data和fill_actors_data中，您可以不时调用一次并批量插入，而不是insert_to_collection()在循环中调用insert_to_collection() 。

Code 码

Below is the code you posted with a few modifications: 以下是您发布的代码，并进行了一些修改：

Add a max_bulk_size which the larger the better for your speed, just make sure it doesn't exceed your RAM. 添加一个max_bulk_size ，越大越适合您的速度，只需确保它不超过RAM。

max_bulk_size = 500

Add a results_list and append result_dict to it. 添加一个results_list并将result_dict附加到它。 Once the size of the list reaches the max_bulk_size , save it and empty the list. 一旦列表的大小达到max_bulk_size ，将其保存并清空列表。

def fill_movie_data():
    '''
    iterates over movie Dataframe
    process values and creates dict structure
    with specific attributes to insert into MongoDB movie collection
    '''


    # load data to pandas Dataframe
    logger.info("Reading movie data to Dataframe")
    data = read_data('datasets/title.basics.tsv')

    results_list = []

    for index, row in data.iterrows():
        result_dict = {}

        id_ = row['tconst']
        title = row['primaryTitle']

        # check value of movie year (if not NaN)
        if not pd.isnull(row['endYear']) and not pd.isnull(row['startYear']):
            year = list([row['startYear'], row['endYear']])
        elif not pd.isnull(row['startYear']):
            year = int(row['startYear'])
        else:
            year = None

        # check value of movie duration (if not NaN)
        if not pd.isnull(row['runtimeMinutes']):
            try:
                duration = int(row['runtimeMinutes'])
            except ValueError:
                duration = None
        else:
            duration = None

        # check value of genres (if not NaN)
        if not pd.isnull(row['genres']):
            genres = row['genres'].split(',')
        else:
            genres = None

        result_dict['_id'] = id_
        result_dict['primary_title'] = title

        # if both years have values
        if isinstance(year, list):
            result_dict['year_start'] = int(year[0])
            result_dict['year_end'] = int(year[1])

        # if start_year has value
        elif year:
            result_dict['year'] = year

        if duration:
            result_dict['duration'] = duration

        if genres:
            result_dict['genres'] = genres

        results_list.append(result_dict)

        if len(results_list) > max_bulk_size:
            insert_to_collection(movie_collection, results_list)
            results_list = []

Same with your other loop. 与您的其他循环相同。

def fill_actors_data():
    '''
    iterates over actors Dataframe
    process values, creates dict structure
    with new fields to insert into MongoDB actors collection
    '''


    logger.info("Inserting data to actors collection")
    # load data to pandas Dataframe
    logger.info("Reading actors data to Dataframe")
    data = read_data('datasets/name.basics.tsv')

    logger.info("Inserting data to actors collection")

    results_list = []

    for index, row in data.iterrows():
        result_dict = {}

        id_ = row['nconst']
        name = row['primaryName']

        # if no birth year and death year value
        if pd.isnull(row['birthYear']):
            yob = None
            alive = False
        # if both birth and death year have value
        elif not pd.isnull(row['birthYear']) and not pd.isnull(row['deathYear']):
            yob = int(row['birthYear'])
            death = int(row['deathYear'])
            age = death - yob
            alive = False
        # if only birth year has value
        else:
            yob = int(row['birthYear'])
            current_year = datetime.now().year
            age = current_year - yob
            alive = True

        if not pd.isnull(row['knownForTitles']):
            movies = row['knownForTitles'].split(',')

        result_dict['_id'] = id_
        result_dict['name'] = name
        result_dict['yob'] = yob
        result_dict['alive'] = alive
        result_dict['age'] = age
        result_dict['movies'] = movies

        results_list.append(result_dict)

        if len(results_list) > max_bulk_size:
            insert_to_collection(actors_collection, results_list)
            results_list = []

        # update movie documents with list of actors ids
        movie_collection.update_many({"_id": {"$in": movies}}, {"$push": { "people": id_}})

从pandas Dataframe到MongoDB的数据加载速度缓慢

问题描述

1 个解决方案

解决方案1
0 2018-12-17 20:54:29

TL;DR TL; DR

`insert_many`

Code 码

从pandas Dataframe到MongoDB的数据加载速度缓慢

问题描述

1 个解决方案

解决方案1 0 2018-12-17 20:54:29

TL;DR TL; DR

insert_many

Code 码

解决方案1
0 2018-12-17 20:54:29

`insert_many`