pd.read_sav 和 pyreadstat 太慢了。如果必须使用 SAV/SPSS 文件格式，如何为大数据加速 Pandas？

Question

I've been transitioning away from SPSS for syntax writing/data management where I work to python and pandas for higher levels of functionality and programming.我一直在从 SPSS 过渡到语法编写/数据管理，在那里我工作到 Python 和 Pandas 以获得更高级别的功能和编程。 The issue is, reading SPSS files into pandas is SO slow.问题是，将 SPSS 文件读入 Pandas 太慢了。 i work with bigger datasets (1 million or more rows often with 100+ columns).我使用更大的数据集（100 万行或更多行，通常有 100 多列）。 it seems that there are some pretty cool plugins out there to speed up processing CSV files such as Dask and Modin, but i don't think these work with SPSS files.似乎有一些非常酷的插件可以加速处理 CSV 文件，例如 Dask 和 Modin，但我认为这些插件不适用于 SPSS 文件。 i'd like to continue using pandas, but i have to stick with the SPSS file format (it's what everyone else where i work uses).我想继续使用 Pandas，但我必须坚持使用 SPSS 文件格式（这是我工作的其他人都使用的格式）。

Are there any tips on how to accomplish faster data processing outside of computer upgrades and or file chunking?是否有关于如何在计算机升级和/或文件分块之外完成更快的数据处理的任何提示？

Answer 1

You can try to parallelize reading your file:您可以尝试并行读取您的文件：

As an example I have a file "big.sav" which is 294000 rows x 666 columns.例如，我有一个文件“big.sav”，它是 294000 行 x 666 列。 Reading the file with pyreadstat.read_sav (which is what pd.read_spss uses in the background) takes 115 seconds.使用 pyreadstat.read_sav（这是 pd.read_spss 在后台使用的）读取文件需要 115 秒。 By parallelizing it I get 29 seconds:通过并行化，我得到 29 秒：

first I create a file worker.py:首先我创建一个文件worker.py：

def worker(inpt):
    import pyreadstat
    offset, chunksize, path = inpt
    df, meta = pyreadstat.read_sav(path, row_offset=offset, row_limit=chunksize)
    return df

and then in the main script I have this:然后在主脚本中我有这个：

import multiprocessing as mp
from time import time

import pandas as pd
import pyreadstat

from worker import worker

# calculate the number of rows in the file
_, meta = pyreadstat.read_sav("big.sav", metadataonly=True)
numrows = meta.number_rows
# calculate number of cores in the machine, this could also be set manually to some number, i.e. 8
numcores = mp.cpu_count()
# calculate the chunksize and offsets
divs = [numrows // numcores + (1 if x < numrows % numcores else 0)  for x in range (numcores) ]
chunksize = divs[0]
offsets = [indx*chunksize for indx in range(numcores)] 
# pack the data for the jobs
jobs = [(x, chunksize, "big.sav") for x in offsets]

pool = mp.Pool(processes=numcores)
# let's go!
t0=time()
chunks = pool.map(worker, jobs)
t1=time()
print(t1-t0) # this prints 29 seconds
# chunks is a list of dataframes in the right order
# you can concatenate all the chunks into a single big dataframe if you like
final = pd.concat(chunks, axis=0, ignore_index=True)

EDIT:编辑：

pyreadstat version 1.0.3 has had a big improvement in performance of about 5x. pyreadstat 1.0.3 版在性能上有了大约 5 倍的大幅提升。
In addition a new function "read_file_multiprocessing" has been added that is a wrapper around the previous code shared in this answer.此外，还添加了一个新函数“read_file_multiprocessing”，它是本答案中共享的先前代码的包装器。 It can give up to another 3x improvement, making (up to) a 15 times improvement compared to the previous version!与之前的版本相比，它最多可以再提高 3 倍，使（最多）提高 15 倍！

You can use the function like this:您可以像这样使用该函数：

import pyreadstat

fpath = "path/to/file.sav" 
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)

pd.read_sav 和 pyreadstat 太慢了。如果必须使用 SAV/SPSS 文件格式，如何为大数据加速 Pandas？

问题描述

1 个解决方案

解决方案1
1 2020-11-01 11:07:21

pd.read_sav 和 pyreadstat 太慢了。 如果必须使用 SAV/SPSS 文件格式，如何为大数据加速 Pandas？

问题描述

1 个解决方案

解决方案1 1 2020-11-01 11:07:21

pd.read_sav 和 pyreadstat 太慢了。如果必须使用 SAV/SPSS 文件格式，如何为大数据加速 Pandas？

解决方案1
1 2020-11-01 11:07:21