在 python 中合并多个大型数据帧的最佳方法是什么？

Question

I'm working on a Google Analytics API, which pulls all of the dimensions and metrics I need and sorts them into dataframes.我正在研究 Google Analytics API，它会提取我需要的所有维度和指标并将它们分类到数据帧中。 My code has nine dataframes in total.我的代码总共有九个数据框。

When I try to merge the dataframes I keep getting a "Killed: 9" error message.当我尝试合并数据帧时，我不断收到“Killed: 9”错误消息。 I know my code is inefficient and is probably taking up a ton of memory as it churns through merge after merge but I don't know how to fix it.我知道我的代码效率低下，并且可能占用了大量 memory，因为它在合并后通过合并搅动，但我不知道如何解决它。

Here's a sample of the merges...这是合并的示例...

MergeThree = pd.merge(MergeTwo, dfFour, how = 'outer', on = ['A', 'B', 'C', 'D']).fillna(0)
MergeThree = MergeThree[[
#dimensions
'A', 'B', 'C', 'D', 'E', 'F',
'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
'P',
#metrics
'Q', 'R', 'S', "T", 'U',
'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', "AC"
]]

MergeFour = pd.merge(MergeThree, dfFive, how = 'outer', on = ['A', 'B', 'C', 'D']).fillna(0)
MergeFour = MergeFour[[
#dimensions
'A', 'B', 'C', 'D', 'E', 'F',
'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
'P', 'AD',
#metrics
'Q', 'R', 'S', "T", 'U',
'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', "AC"
]]

MergeFive = pd.merge(MergeFour, dfSix, how = 'outer', on = ['A', 'B', 'C', 'D']).fillna(0)
MergeFive = MergeFive[[
#dimensions
'A', 'B', 'C', 'D', 'E', 'F',
'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
'P', 'AD', 'AE',
#metrics
'Q', 'R', 'S', "T", 'U',
'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', "AC"
]]

ect.

I've tried many different versions of the merges and the only one I can kind of get to work looks like this..我已经尝试了许多不同版本的合并，我唯一可以开始工作的版本看起来像这样..

def MergeProcessThree(x):
    MergeThree = pd.merge(x, dfFourX, how = 'outer', on = ['A', 'B', 'C', 'D']).fillna(0)
    MergeThree = MergeThree[[
    #dimensions
    'A', 'B', 'C', 'D', 'E', 'F',
    'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
    'P',
    #metrics
    'Q', 'R', 'S', "T", 'U',
    'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', "AC"'
    ]]
    MergeThree.to_csv('MergeThree.csv.gz', mode='a', index=False, compression='gzip')

MergeTwoX = pd.read_csv('MergeTwo.csv.gz', chunksize=100, compression='gzip')

for i in MergeTwoX:
    MergeProcessThree(i)

print('Merge Three Complete')

def MergeProcessFour(x):
    MergeFour = pd.merge(x, dfFiveX, how = 'outer', on = [''A', 'B', 'C', 'D']).fillna(0)
    MergeFour = MergeFour[[
    #dimensions
    'A', 'B', 'C', 'D', 'E', 'F',
    'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
    'P', 'AD',
    #metrics
    'Q', 'R', 'S', "T", 'U',
    'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB', "AC"
    ]]
    MergeFour.to_csv('MergeFour.csv.gz', mode='a', index=False, compression='gzip')

MergeThreeX = pd.read_csv('MergeThree.csv.gz', chunksize=100, compression='gzip')

for i in MergeThreeX:
    MergeProcessFour(i)

print('Merge Four Complete')

etc.

But the data doesn't look right.但是数据看起来不太对。 It looks like it's essentially being doubled but things are missing that are in the normal merges that aren't in the ones broken out by chunk.看起来它基本上被加倍了，但是在正常合并中缺少的东西不在按块分解的合并中。

I know there has to be a better way to this and get the results I'm looking for.我知道必须有更好的方法来获得我正在寻找的结果。

Any help on this would be greatly appreciated!对此的任何帮助将不胜感激！

Answer 1

As Chaos mentioned, there isn't a fixed way to do the compression, sometimes you could gain a lot from it and other times may not help much.正如 Chaos 所提到的，没有固定的压缩方式，有时您可以从中获得很多收益，而其他时候可能无济于事。

The general idea is you can use less precision to represent your number if that does not change the original value or it's within the tolerated threshold.一般的想法是，如果不改变原始值或在允许的阈值内，您可以使用较低的精度来表示您的数字。 For example if a column is for sure only binary value {0, 1}s, then you can just use np.int8 instead of common np.int32 or 64, you can do that by simply df[binary_column_name] = df[binary_column_name].astype(int) , another example, np.float16(1.23456789)=1.234 if this truncation is acceptable to your application.例如，如果一列肯定只有二进制值 {0, 1}s，那么您可以只使用 np.int8 而不是常见的 np.int32 或 64，您可以通过简单地df[binary_column_name] = df[binary_column_name].astype(int) ，另一个例子， np.float16(1.23456789)=1.234如果你的应用程序可以接受这种截断。

You can write a function that does this somewhat automatically,您可以编写一个 function 自动执行此操作，

First you check if the column is integer首先检查列是否为 integer
- Then check if it contains negative values然后检查它是否包含负值
  - Positive: check the value range it falls into, eg if max is less that 2^8=256 then you know you can represent it with np.int8, if less than 2^16 then you can represent it with np.int16正：检查它所属的值范围，例如如果 max 小于 2^8=256 那么你知道你可以用 np.int8 来表示它，如果小于 2^16 那么你可以用 np.int16 来表示它
  - Negative: similar as positive, but now check if your value falls in eg np.iinfo(np.int8) -> min=-128, max=127负数：类似于正数，但现在检查您的值是否落在例如np.iinfo(np.int8) -> min=-128, max=127
Float: Similar as above, check the value range and the precision you wanted Float：和上面类似，检查取值范围和你想要的精度

You can either look at system info, or pandas.DataFrame.memory_usage to compare how much memory reduction you get after doing above steps.您可以查看系统信息或pandas.DataFrame.memory_usage来比较执行上述步骤后您减少了多少 memory。

Also note, some system does not support certain dtypes, so you may need to convert it to the accepted dtypes after the merge.另请注意，某些系统不支持某些 dtype，因此您可能需要在合并后将其转换为接受的 dtype。 (eg if you want to save a df to feather, it does not accept float16 afik) （例如，如果您想将 df 保存为羽毛，它不接受 float16 afik）

在 python 中合并多个大型数据帧的最佳方法是什么？

问题描述

1 个解决方案

解决方案1
1 2020-06-27 17:56:02

在 python 中合并多个大型数据帧的最佳方法是什么？

问题描述

1 个解决方案

解决方案1 1 2020-06-27 17:56:02

解决方案1
1 2020-06-27 17:56:02