在字典上加載多個 Pandas 數據幀時，Dask 高內存消耗

Question

我有一個文件夾（7.7GB），其中包含以鑲木地板文件格式存儲的多個熊貓數據框。 我需要在 python 字典中加載所有這些數據幀，但由於我只有 32GB 的 RAM，我使用 .loc 方法來加載我需要的數據。

當所有數據幀都加載到 python 字典的內存中時，我從所有數據的索引創建一個公共索引，然后我用新索引重新索引所有數據幀。

我開發了兩個腳本來做到這一點，第一個是經典的順序方式，第二個是使用 Dask 來從我的 Threadripper 1920x 的所有內核中獲得一些性能改進。

順序代碼：

# Standard library imports
import os
import pathlib
import time

# Third party imports
import pandas as pd

# Local application imports


class DataProvider:

def __init__(self):

    self.data = dict()

def load_parquet(self, source_dir: str, timeframe_start: str, timeframe_end: str) -> None:

    t = time.perf_counter()

    symbol_list = list(file for file in os.listdir(source_dir) if file.endswith('.parquet'))

    # updating containers
    for symbol in symbol_list:

        path = pathlib.Path.joinpath(pathlib.Path(source_dir), symbol)
        name = symbol.replace('.parquet', '')

        self.data[name] = pd.read_parquet(path).loc[timeframe_start:timeframe_end]

    print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')

    t = time.perf_counter()

    # building index
    index = None

    for symbol in self.data:

        if index is not None:
            index.union(self.data[symbol].index)
        else:
            index = self.data[symbol].index

    print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')

    t = time.perf_counter()

    # reindexing data
    for symbol in self.data:

        self.data[symbol] = self.data[symbol].reindex(index=index, method='pad').itertuples()

    print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')


if __name__ == '__main__' or __name__ == 'builtins':

    source = r'WindowsPath'

    x = DataProvider()
    x.load_parquet(source_dir=source, timeframe_start='2015', timeframe_end='2015')

達斯克代碼：

# Standard library imports
import os
import pathlib
import time

# Third party imports
from dask.distributed import Client
import pandas as pd

# Local application imports


def __load_parquet__(directory, timeframe_start, timeframe_end):
    return pd.read_parquet(directory).loc[timeframe_start:timeframe_end]


def __reindex__(new_index, df):
    return df.reindex(index=new_index, method='pad').itertuples()


if __name__ == '__main__' or __name__ == 'builtins':

    client = Client()

    source = r'WindowsPath'
    start = '2015'
    end = '2015'

    t = time.perf_counter()

    file_list = [file for file in os.listdir(source) if file.endswith('.parquet')]

    # build data
    data = dict()
    for file in file_list:

        path = pathlib.Path.joinpath(pathlib.Path(source), file)
        symbol = file.replace('.parquet', '')

        data[symbol] = client.submit(__load_parquet__, path, start, end)

    print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')

    t = time.perf_counter()

    # build index
    index = None
    for symbol in data:
        if index is not None:
            index.union(data[symbol].result().index)
        else:
            index = data[symbol].result().index

    print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')

    t = time.perf_counter()

    # reindex
    for symbol in data:
        data[symbol] = client.submit(__reindex__, index, data[symbol].result())

    print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')

我發現結果很奇怪。

順序代碼：

計算期間的最大內存消耗： 30.2GB
計算結束時的內存消耗： 15.6GB
總內存消耗（不包括 Windows 和其他）： 11.6GB
在54.289秒內加載數據。
在0.428秒內建立索引。
在9.666秒重建索引數據。

達斯克代碼：

計算期間的最大內存消耗： 25.2GB
計算結束時的內存消耗： 22.6GB
總內存消耗（不包括 Windows 和其他）： 18.9GB
在0.638秒內加載數據。
在27.541秒內建立索引。
在30.179秒內重新索引數據。

我的問題：

為什么使用 Dask 計算結束時的內存消耗要高得多？
為什么 Dask 構建公共索引並重新索引所有數據幀需要花費這么多時間？

此外，當使用 Dask 代碼時，控制台會向我打印以下錯誤。

C:\Users\edit\Anaconda3\envs\edit\lib\site-packages\distribute\worker.py:901:UserWarning: Large object of size 5.41 MB detected in task graph: 
(DatetimeIndex(['2015-01-02 09:30:00', '2015-01-02 ... s x 5 columns])
Consider scattering large objects ahead of time with client.scatter to reduce  scheduler burden and keep data on workers
future = client.submit(func, big_data)    # bad
big_future = client.scatter(big_data)     # good
future = client.submit(func, big_future)  # good
% (format_bytes(len(b)), s))

即使錯誤建議真的很好，我也不明白我的代碼有什么問題。 為什么說保留工人的數據？ 我認為使用 submit 方法，我將所有數據發送給我的客戶，因此工作人員可以輕松訪問所有數據。 謝謝大家的幫助。

Answer 1

我根本不是專家，只是嘗試提供幫助。 您可能想嘗試不使用time.perf_counter ，看看是否有任何改變。

在字典上加載多個 Pandas 數據幀時，Dask 高內存消耗

問題描述

1 個解決方案

解決方案1
-1 2020-12-25 22:26:12

在字典上加載多個 Pandas 數據幀時，Dask 高內存消耗

問題描述

1 個解決方案

解決方案1 -1 2020-12-25 22:26:12

解決方案1
-1 2020-12-25 22:26:12