简体   繁体   English

pd.read_sav 和 pyreadstat 太慢了。 如果必须使用 SAV/SPSS 文件格式,如何为大数据加速 Pandas?

[英]pd.read_sav and pyreadstat are so slow. how can i speed up pandas for big data if i have to use SAV/SPSS file format?

I've been transitioning away from SPSS for syntax writing/data management where I work to python and pandas for higher levels of functionality and programming.我一直在从 SPSS 过渡到语法编写/数据管理,在那里我工作到 Python 和 Pandas 以获得更高级别的功能和编程。 The issue is, reading SPSS files into pandas is SO slow.问题是,将 SPSS 文件读入 Pandas 太慢了。 i work with bigger datasets (1 million or more rows often with 100+ columns).我使用更大的数据集(100 万行或更多行,通常有 100 多列)。 it seems that there are some pretty cool plugins out there to speed up processing CSV files such as Dask and Modin, but i don't think these work with SPSS files.似乎有一些非常酷的插件可以加速处理 CSV 文件,例如 Dask 和 Modin,但我认为这些插件不适用于 SPSS 文件。 i'd like to continue using pandas, but i have to stick with the SPSS file format (it's what everyone else where i work uses).我想继续使用 Pandas,但我必须坚持使用 SPSS 文件格式(这是我工作的其他人都使用的格式)。

Are there any tips on how to accomplish faster data processing outside of computer upgrades and or file chunking?是否有关于如何在计算机升级和/或文件分块之外完成更快的数据处理的任何提示?

You can try to parallelize reading your file:您可以尝试并行读取您的文件:

As an example I have a file "big.sav" which is 294000 rows x 666 columns.例如,我有一个文件“big.sav”,它是 294000 行 x 666 列。 Reading the file with pyreadstat.read_sav (which is what pd.read_spss uses in the background) takes 115 seconds.使用 pyreadstat.read_sav(这是 pd.read_spss 在后台使用的)读取文件需要 115 秒。 By parallelizing it I get 29 seconds:通过并行化,我得到 29 秒:

first I create a file worker.py:首先我创建一个文件worker.py:

def worker(inpt):
    import pyreadstat
    offset, chunksize, path = inpt
    df, meta = pyreadstat.read_sav(path, row_offset=offset, row_limit=chunksize)
    return df

and then in the main script I have this:然后在主脚本中我有这个:

import multiprocessing as mp
from time import time

import pandas as pd
import pyreadstat

from worker import worker

# calculate the number of rows in the file
_, meta = pyreadstat.read_sav("big.sav", metadataonly=True)
numrows = meta.number_rows
# calculate number of cores in the machine, this could also be set manually to some number, i.e. 8
numcores = mp.cpu_count()
# calculate the chunksize and offsets
divs = [numrows // numcores + (1 if x < numrows % numcores else 0)  for x in range (numcores) ]
chunksize = divs[0]
offsets = [indx*chunksize for indx in range(numcores)] 
# pack the data for the jobs
jobs = [(x, chunksize, "big.sav") for x in offsets]

pool = mp.Pool(processes=numcores)
# let's go!
t0=time()
chunks = pool.map(worker, jobs)
t1=time()
print(t1-t0) # this prints 29 seconds
# chunks is a list of dataframes in the right order
# you can concatenate all the chunks into a single big dataframe if you like
final = pd.concat(chunks, axis=0, ignore_index=True)

EDIT:编辑:

  1. pyreadstat version 1.0.3 has had a big improvement in performance of about 5x. pyreadstat 1.0.3 版在性能上有了大约 5 倍的大幅提升。
  2. In addition a new function "read_file_multiprocessing" has been added that is a wrapper around the previous code shared in this answer.此外,还添加了一个新函数“read_file_multiprocessing”,它是本答案中共享的先前代码的包装器。 It can give up to another 3x improvement, making (up to) a 15 times improvement compared to the previous version!与之前的版本相比,它最多可以再提高 3 倍,使(最多)提高 15 倍!

You can use the function like this:您可以像这样使用该函数:

import pyreadstat

fpath = "path/to/file.sav" 
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM