Python / Pandas-在庞大的数据框上应用函数时出现内存问题

Question

I have a dataframe which has 350 million rows , 3 columns 我有一个具有3.5亿行3列的数据框

Requirement : 要求：

I want to split DESCRIPTION column into LIST based on pipe symbol using less memory 我想使用更少的内存基于管道符号将Description列拆分为LIST

input_df.head(): input_df.head（）：

    startTime   DESCRIPTION                                                                                                                                     Response_Time
    1504212340  Business Transaction Performance|Business Transactions|Hexa|mBanking Confirmation.(Confirmation.aspx).no|Average Response Time (ms)_value       6
    1504212340  Business Transaction Performance|Business Transactions|Hexa|mBanking Frontpage.ci|Average Response Time (ms)_value                              4
    1504202341  Business Transaction Performance|Business Transactions|Hexa|mBanking Fonto KTList GenericNS.(GenericNS).dk|Average Response Time (ms)_value     5
    1504202341  Business Transaction Performance|Business Transactions|Hexa|mBanking Transaction Overview.co|Average Response Time (ms)_value                   5
    1504202342  Business Transaction Performance|Business Transactions|Hexa|mBanking Logon.(BidError.aspx).no|Average Response Time (ms)_value                  8

desired_output: wanted_output：

    startTime   list_Description                                                                                                                                             Response_Time
    1504212340  ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Confirmation.(Confirmation.aspx).no', 'Average Response Time (ms)_value']    6
    1504212340  ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Frontpage.ci', 'Average Response Time (ms)_value']                           4
    1504202341  ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Fonto KTList GenericNS.(GenericNS).dk', 'Average Response Time (ms)_value']  5
    1504202341  ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Transaction Overview.co', 'Average Response Time (ms)_value']                5
    1504202342  ['Business Transaction Performance', 'Business Transactions', 'Hexa', 'mBanking Logon.(BidError.aspx).no', 'Average Response Time (ms)_value']               8

my code: 我的代码：

    import pandas as pd
    import glob

    path = r'C:/Users/IBM_ADMIN/Desktop/Delete/Source/app_dynamics/*'    #500 csv files in this location
    all_files = glob.glob(path) 

    #Get the input files and concatenate   
    generator  = (pd.read_csv(f, delimiter='\t', dtype=float) for f in all_files)   #Using parentheses returns a generator instead of a list, mentioning 'dtype=float' helps to use less memory
    input_df   = pd.concat(generator , ignore_index=True)   #results in 350 million rows , 3 columns
    input_df['list_Description'] = input_df['DESCRIPTION'].str.split('|')  #Splitting the string into list

My code's Drawbacks 我的代码的缺点

The above code works good for less number of rows in a dataframe. 上面的代码适用于数据帧中较少的行数。 But If I apply it for 350 million rows ,my memory is hitting 98% instantly and system hangs. 但是，如果我将其应用于3.5亿行，我的内存立即达到98％，并且系统挂起。

A csv might have helped.. BUT CSV可能有帮助。但是

If I have the 'input_df' in a csv file then, I can process in chunks(btw, in this case I don't want to write the 'input_df' to a csv :-) ). 如果我在csv文件中有'input_df'，那么我可以分块处理（顺便说一句，在这种情况下，我不想将'input_df'写入csv :-)）。 Since the above 'input_df' is a dataframe I don't know how to start with. 由于上面的“ input_df”是一个数据框，所以我不知道如何开始。 It would be good if there is way to use chunksize directly on dataframe 如果有办法直接在数据帧上使用chunksize，那就太好了

Can some one give a better idea to avoid memory issue please? 有人可以提出一个更好的主意以避免内存问题吗？

Answer 1

I can't guarantee this is going to work, since I don't have the same data on which to test it, but could you apply your splitting function on the chunks as you read them so that you don't have to hold that massive column in memory twice? 我不能保证它能正常工作，因为我没有相同的数据来测试它，但是您可以在读取它们时将拆分功能应用到这些块上，这样就不必保留它了大容量列在内存中两次？

Amending your code, you could try the following: 修改代码，您可以尝试以下操作：

import pandas as pd
import glob

path = r'C:/Users/IBM_ADMIN/Desktop/Delete/Source/app_dynamics/*'    #500 csv files in this location
all_files = glob.glob(path) 

def read_and_split(f):
    chunk = pd.read_csv(f, delimiter='\t', dtype=float)
    chunk['list_Description'] = chunk['DESCRIPTION'].str.split('|')
    return chunk.drop('DESCRIPTION', axis=1)

#Get the input files and concatenate   
generator  = (read_and_split(f) for f in all_files)   #Using parentheses returns a generator instead of a list, mentioning 'dtype=float' helps to use less memory
input_df   = pd.concat(generator, ignore_index=True)   #results in 350 million rows , 3 columns

If this still ends up not working, you may check out Dask , which allows you to store very large DataFrames in a distributed capacity. 如果仍然无法正常工作，则可以签出Dask ，它允许您以分布式容量存储非常大的DataFrame。

Answer 2

Since it appears like the DESCRIPTION column contains a lot of duplicate values, you could build a lookup table like so: 由于它看起来像DESCRIPTION列中包含很多重复的值，因此您可以像这样构建查找表：

lookup = input_df.DESCRIPTION.drop_duplicates().reset_index()
lookup = lookup.reset_index().rename(columns={'index': 'description_id'})
input_df = input_df.merge(lookup, on='DESCRIPTION')
lookup = pd.concat([lookup, lookup.DESCRIPTION.str.split('|', expand=True)],
                   axis=1)

At this point you can get rid of the DESCRIPTION columns in both the lookup and the input_df , since all the necessary information is contained within the columns of the lookup data frame. 此时，您可以摆脱lookup和input_df中的DESCRIPTION列，因为所有必需的信息都包含在lookup数据帧的列中。

input_df.drop('DESCRIPTION', axis=1, inplace=True)
lookup.drop('DESCRIPTION', axis=1, inplace=True)

The input_df will now have an description_id column which tells you which row of the lookup data frame contains the info extracted from the DESCRIPTION . input_df现在将具有description_id列，该列告诉您lookup数据帧的哪一行包含从DESCRIPTION提取的信息。

Python / Pandas-在庞大的数据框上应用函数时出现内存问题

问题描述

Requirement : 要求：

my code: 我的代码：

My code's Drawbacks 我的代码的缺点

A csv might have helped.. BUT CSV可能有帮助。但是

2 个解决方案

解决方案1
0 2018-03-26 20:43:29

解决方案2
0 2018-03-26 21:10:49

Python / Pandas-在庞大的数据框上应用函数时出现内存问题

问题描述

Requirement : 要求：

my code: 我的代码：

My code's Drawbacks 我的代码的缺点

A csv might have helped.. BUT CSV可能有帮助。但是

2 个解决方案

解决方案1 0 2018-03-26 20:43:29

解决方案2 0 2018-03-26 21:10:49

解决方案1
0 2018-03-26 20:43:29

解决方案2
0 2018-03-26 21:10:49