Docker中的Pandas iterrows（）太慢

Question

I am iterating over a csv file stored in my docker. 我正在遍历存储在泊坞窗中的csv文件。 I want to iterate over the rows. 我想遍历行。 The same script in my local(w/o docker) is done executing in 6 mins but when inside docker, reading 20 rows takes a min or two(there are 1.3M rows). 我本地（w / o docker）中的相同脚本在6分钟内执行完，但是在docker内部时，读取20行需要一两分钟（有130万行）。 The size of the csv file that is being read is 837MB 正在读取的csv文件的大小为837MB

The code is as follows: 代码如下：

## added a script in the process just for test
import datetime
import sys

import pandas as pd

cleanup_consent_column = "rwJIedeRwS"
omc_master_header = [u'PPAC District Code', u'State Name', u'District Name', u'Distributor Code', u'OMC Name', u'Distributor Contact No', u'Distributor Name', u'Distributor Address', u'SO Name', u'SO Contact', u'SALES AREA CODE', u'Email', u'DNO Name', u'DNO Contact', u'Lat_Mixed', u'Long_Mixed']

#OMC_DISTRIBUTOR_MASTER = "/mnt/data/NFS/TeamData/Multiple/external/mopng/5Feb18_master_ujjwala_latlong_dist_dno_so_v7.csv"
#PPAC_MASTER = "/mnt/data/NFS/TeamData/Multiple/external/mopng/ppac_master_v3_mmi_enriched_with_sanity_check.csv"

def clean(input_filepath, OMC_DISTRIBUTOR_MASTER, PPAC_MASTER, output_filepath):
    print("Taylor Swift's clean.")
    df = pd.read_csv(input_filepath, encoding='utf-8', dtype=object)
    print ('length of input - {0} - num cols - {1}'.format(len(df), len(df.columns.tolist())))
    ## cleanup consent column
    for x in df.columns.tolist():
        if x.startswith("rwJIedeRwS"):
            del df[x]
            break
    ## strip ppac code from the baseline
    df['consumer_id_name_ppac_code'] = df['consumer_id_name_ppac_code'].str.strip()

    ## merge with entity to get entity_ids
    omc_distributor_master = pd.read_csv(OMC_DISTRIBUTOR_MASTER, dtype=object, usecols=omc_master_header)
    omc_distributor_master = omc_distributor_master.add_prefix("omc_dist_master_")
    df = pd.merge(
        df, omc_distributor_master, how='left',
        left_on=['consumer_id_name_distributor_code', 'consumer_id_name_omc_name'],
        right_on=['omc_dist_master_Distributor Code', 'omc_dist_master_OMC Name']
    )

    ## log if anything not found
    print ('responses without distributor enrichment - {0}'.format(len(df[df['omc_dist_master_Distributor Code'].isnull()])))
    print ('num distributors without enrichment - {0}'.format(
        len(pd.unique(df[df['omc_dist_master_Distributor Code'].isnull()]['consumer_id_name_distributor_code']))
    ))

    ## converting date column
    df['consumer_id_name_sv_date'] = pd.to_datetime(df['consumer_id_name_sv_date'], format="%d/%m/%Y")
    df['consumer_id_name_sv_date'] = df['consumer_id_name_sv_date'].dt.strftime("%Y-%m-%d")

    ## add eventual_ppac_code
    print ("generating eventual ppac code column")
    count_de_rows = 0
    start_time = datetime.datetime.now()
    for i, row in df.iterrows():
        count_de_rows += 1
        if count_de_rows % 10000 == 0:
            print(count_de_rows)
        ## if not found in master - use baseline data else go with omc master
        if row['omc_dist_master_PPAC District Code'] != row['omc_dist_master_PPAC District Code']:
            df.ix[i, 'eventual_ppac_code'] = row['consumer_id_name_ppac_code']
        else:
            df.ix[i, 'eventual_ppac_code'] = row['omc_dist_master_PPAC District Code']
    print(datetime.datetime.now() - start_time)
    print("I guess it's all alright!")


if __name__ == '__main__':
    print("The main function has been called!")
    clean(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])

Answer 1

Why do you use loop your rows in the first place? 为什么首先使用循环行？ This seems like it can be done vectorized: 看来可以将其向量化：

df["eventual_ppac_code"] = df["omc_dist_master_PPAC District Code"]
df.loc[df["omc_dist_master_PPAC District Code"] != df["omc_dist_master_PPAC District Code"], "eventual_ppac_code"] = df["consumer_id_name_ppac_code"]

Having said that, when exactly do you expect omc_dist_master_PPAC District Code to not equal omc_dist_master_PPAC District Code ? 话虽如此，您到底什么时候期望omc_dist_master_PPAC District Code 不等于omc_dist_master_PPAC District Code ？ It's the exact same column? 是同一列吗？

Answer 2

The base premise of docker == ubuntu system is a logical fallacy that I was having. docker == ubuntu系统的基本前提是我遇到的逻辑谬误。 Yes, it is right to optimize code as much as possible but the same code within two systems showed different stats, docker being slow. 是的，尽可能地优化代码是正确的，但是两个系统中的相同代码显示出不同的统计信息，而docker速度很慢。 Having said that, I started working with chunksize so as to reduce the memory burden. 话虽如此，我开始使用chunksize ，以减少内存负担。 The context switch of (read and write) with such large data was what making docker slow(especially the write). 具有如此大数据的（读和写）上下文切换是使docker变慢（尤其是写操作）的原因。 It should be noted that memory was not the issue, writing large data in persistent storage via docker is slower than in our systems. 应当注意，内存不是问题，通过docker在持久性存储中写入大数据比在我们的系统中慢。

Docker中的Pandas iterrows（）太慢

问题描述

2 个解决方案

解决方案1
0 2018-12-21 08:25:25

解决方案2
0 已采纳 2019-01-17 10:26:09

Docker中的Pandas iterrows（）太慢

问题描述

2 个解决方案

解决方案1 0 2018-12-21 08:25:25

解决方案2 0 已采纳 2019-01-17 10:26:09

解决方案1
0 2018-12-21 08:25:25

解决方案2
0 已采纳 2019-01-17 10:26:09