[英]pd.read_sav and pyreadstat are so slow. how can i speed up pandas for big data if i have to use SAV/SPSS file format?
I've been transitioning away from SPSS for syntax writing/data management where I work to python and pandas for higher levels of functionality and programming.我一直在从 SPSS 过渡到语法编写/数据管理,在那里我工作到 Python 和 Pandas 以获得更高级别的功能和编程。 The issue is, reading SPSS files into pandas is SO slow.
问题是,将 SPSS 文件读入 Pandas 太慢了。 i work with bigger datasets (1 million or more rows often with 100+ columns).
我使用更大的数据集(100 万行或更多行,通常有 100 多列)。 it seems that there are some pretty cool plugins out there to speed up processing CSV files such as Dask and Modin, but i don't think these work with SPSS files.
似乎有一些非常酷的插件可以加速处理 CSV 文件,例如 Dask 和 Modin,但我认为这些插件不适用于 SPSS 文件。 i'd like to continue using pandas, but i have to stick with the SPSS file format (it's what everyone else where i work uses).
我想继续使用 Pandas,但我必须坚持使用 SPSS 文件格式(这是我工作的其他人都使用的格式)。
Are there any tips on how to accomplish faster data processing outside of computer upgrades and or file chunking?是否有关于如何在计算机升级和/或文件分块之外完成更快的数据处理的任何提示?
You can try to parallelize reading your file:您可以尝试并行读取您的文件:
As an example I have a file "big.sav" which is 294000 rows x 666 columns.例如,我有一个文件“big.sav”,它是 294000 行 x 666 列。 Reading the file with pyreadstat.read_sav (which is what pd.read_spss uses in the background) takes 115 seconds.
使用 pyreadstat.read_sav(这是 pd.read_spss 在后台使用的)读取文件需要 115 秒。 By parallelizing it I get 29 seconds:
通过并行化,我得到 29 秒:
first I create a file worker.py:首先我创建一个文件worker.py:
def worker(inpt):
import pyreadstat
offset, chunksize, path = inpt
df, meta = pyreadstat.read_sav(path, row_offset=offset, row_limit=chunksize)
return df
and then in the main script I have this:然后在主脚本中我有这个:
import multiprocessing as mp
from time import time
import pandas as pd
import pyreadstat
from worker import worker
# calculate the number of rows in the file
_, meta = pyreadstat.read_sav("big.sav", metadataonly=True)
numrows = meta.number_rows
# calculate number of cores in the machine, this could also be set manually to some number, i.e. 8
numcores = mp.cpu_count()
# calculate the chunksize and offsets
divs = [numrows // numcores + (1 if x < numrows % numcores else 0) for x in range (numcores) ]
chunksize = divs[0]
offsets = [indx*chunksize for indx in range(numcores)]
# pack the data for the jobs
jobs = [(x, chunksize, "big.sav") for x in offsets]
pool = mp.Pool(processes=numcores)
# let's go!
t0=time()
chunks = pool.map(worker, jobs)
t1=time()
print(t1-t0) # this prints 29 seconds
# chunks is a list of dataframes in the right order
# you can concatenate all the chunks into a single big dataframe if you like
final = pd.concat(chunks, axis=0, ignore_index=True)
EDIT:编辑:
You can use the function like this:您可以像这样使用该函数:
import pyreadstat
fpath = "path/to/file.sav"
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.