多进程 pandas pd.read_sas

Question

I have an extremely large dataset from SAS and want to load it into Python using multiprocessing (if possible).我有一个来自 SAS 的非常大的数据集，我想使用多处理（如果可能）将它加载到 Python 中。 My current code is:我当前的代码是：

import pandas as pd
from multiprocessing import Pool

sas_file = pd.read_sas('path',
                       encoding='ISO-8859-1',
                       chunksize=100000,
                       iterator=True)


def process_sas(chunk):
    dfs.append(chunk)


if __name__ == '__name__':
    pool=Pool()
    pool.map(process_sas, sas_file)

However, dfs is not defined using this method.但是，没有使用此方法定义 dfs。 Is there any way to multiprocess the SAS data set?有什么办法可以对 SAS 数据集进行多处理吗？ Separating the data into chunks is not a requirement.将数据分成块不是必需的。

Thanks,谢谢，

Answer 1

Depending on where you defined dfs取决于您定义 dfs 的位置
Quick fix but not ideal: put dfs within the global scope快速修复但不理想：将 dfs 放在全局 scope 中
and then declare global dfs inside the respective function and/or method.然后在相应的 function 和/或方法中声明global dfs 。

def process_sas(chunk):
    global dfs
    dfs.append(chunk)

多进程 pandas pd.read_sas

问题描述

1 个解决方案

解决方案1
0 2021-09-29 14:30:20

多进程 pandas pd.read_sas

问题描述

1 个解决方案

解决方案1 0 2021-09-29 14:30:20

解决方案1
0 2021-09-29 14:30:20