简体   繁体   中英

pandas inner join performance issue

I have two csv file and I load them into pandas data frame. One file is large, about 10M rows and 20 columns (all string type) and size is around 1G bytes, the other file is small, about 5k rows, and 5 columns and size is around 1M. I want to do inner join by a single common column between the two data frame.

This is how I join,

mergedDataSet = pd.merge(smallDataFrame, largeDataFrame, on='uid', how='inner')

I tried if I sample 1% of the big data set, program runs smoothly without any issues and complete within 5 seconds, so I verified function should be ok for my code.

But if I join the real large data set, the program will be terminated in about 20-30 seconds, error message is Process finished with exit code 137 (interrupted by signal 9: SIGKILL) . I am using Python 2.7 with miniconda, on Mac OSX and I run from PyCharm. My machine has 16G memory and well above the size of 1G file.

Wondering if any thoughts to tune performance of data frame join in pandas, or any other quick solution for inner join?

Another confusion from me is, why the program is KILLed? By whom and why reason?

Edit 1 , error captured in /var/log/system.log when doing inner join,

Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:18 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 1 Memory pressure state: 0
Aug 27 11:00:33 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=1 isKey=1 isVisible=1 delegate=0x7fb3659d3960>>: 0.02136099338531494
Aug 27 11:00:41 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01138699054718018
Aug 27 11:00:46 foo-laptop kernel[0]: low swap: killing pid 92118 (python2.7)
Aug 27 11:00:46 foo-laptop kernel[0]: memorystatus_thread: idle exiting pid 789 [CallHistoryPlugi]
Aug 27 11:00:56 foo-laptop iTerm2[43018]: Time to encode state for window <PseudoTerminal: 0x7fb3659d3960 tabs=1 window=<PTYWindow: 0x7fb3637c0c80 frame=NSRect: {{0, 0}, {1280, 800}} title=5. tail alpha=1.000000 isMain=0 isKey=0 isVisible=1 delegate=0x7fb3659d3960>>: 0.01823097467422485
Aug 27 11:00:58 foo-laptop kernel[0]: process WeChat[85077] caught causing excessive wakeups. Observed wakeups rate (per sec): 184; Maximum permitted wakeups rate (per sec): 150; Observation period: 300 seconds; Task lifetime number of wakeups: 2193951
Aug 27 11:00:58 foo-laptop com.apple.xpc.launchd[1] (com.apple.ReportCrash[92123]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): com.apple.ReportCrash
Aug 27 11:00:58 foo-laptop ReportCrash[92123]: Invoking spindump for pid=85077 wakeups_rate=184 duration=245 because of excessive wakeups
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[702]: Thermal pressure state: 0 Memory pressure state: 0
Aug 27 11:01:03 foo-laptop com.apple.CDScheduler[47]: Thermal pressure state: 0 Memory pressure state: 0

regards, Lin

Check the cardinality of 'uid' column on both sides. It is most probable that your join is multiplying the data manyfold. For example, if you have uid with value 1 in 100 records of dataframe1 and in 10 records in dataframe2, your join would yield 1000 records.

to check the cardinality, I would do the following:

df1[df1.uid.isin(df2.uid.unique())]['uid'].value_counts()
df2[df2.uid.isin(df1.uid.unique())]['uid'].value_counts()

This code will check if the values of 'uid' that are present in other frame's uid and have duplicates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM