简体   繁体   English

在 Pandas 中使用多处理读取 csv 文件的最简单方法

[英]Easiest way to read csv files with multiprocessing in Pandas

Here is my question.这是我的问题。
With bunch of .csv files(or other files).使用一堆 .csv 文件(或其他文件)。 Pandas is an easy way to read them and save into Dataframe format. Pandas 是一种读取它们并将其保存为Dataframe格式的简单方法。 But when the amount of files was huge, I want to read the files with multiprocessing to save some time.但是当文件量很大时,我想通过多处理读取文件以节省一些时间。

My early attempt我的早期尝试

I manually divide the files into different path.我手动将文件分成不同的路径。 Using severally:分别使用:

os.chdir("./task_1")
files = os.listdir('.')
files.sort()
for file in files:
    filename,extname = os.path.splitext(file)
    if extname == '.csv':
        f = pd.read_csv(file)
        df = (f.VALUE.as_matrix()).reshape(75,90)   

And then combine them.然后将它们组合起来。

How to run them with pool to achieve my problem?如何使用pool运行它们来解决我的问题?
Any advice would be appreciated!任何建议将不胜感激!

Using Pool :使用Pool

import os
import pandas as pd 
from multiprocessing import Pool

# wrap your csv importer in a function that can be mapped
def read_csv(filename):
    'converts a filename to a pandas dataframe'
    return pd.read_csv(filename)


def main():

    # get a list of file names
    files = os.listdir('.')
    file_list = [filename for filename in files if filename.split('.')[1]=='csv']

    # set up your pool
    with Pool(processes=8) as pool: # or whatever your hardware can support

        # have your pool map the file names to dataframes
        df_list = pool.map(read_csv, file_list)

        # reduce the list of dataframes to a single dataframe
        combined_df = pd.concat(df_list, ignore_index=True)

if __name__ == '__main__':
    main()

dask库不仅旨在解决您的问题,而且可以肯定地解决您的问题。

If you aren't against using another library, you could use Graphlab 's sframe.如果您不反对使用其他库,则可以使用Graphlab的 sframe。 This creates an object similar to data frames which is very fast to read data if performance is a big issue.这将创建一个类似于数据帧的对象,如果性能是一个大问题,它可以非常快地读取数据。

I am not getting map/map_async to work, but managed to work with apply_async .我没有让map/map_async工作,但设法与apply_async一起工作。

Two possible ways (I have no idea which one is better):两种可能的方式(我不知道哪一种更好):

  • A) Concat at the end A)最后连接
  • B) Concat during B) 连接期间

I find glob easy to list and fitler files from a directory我发现glob易于从目录中列出拟合文件

from glob import glob
import pandas as pd
from multiprocessing import Pool

folder = "./task_1/" # note the "/" at the end
file_list = glob(folder+'*.xlsx')

def my_read(filename):
    f = pd.read_csv(filename)
    return (f.VALUE.as_matrix()).reshape(75,90)

#DF_LIST = [] # A) end
DF = pd.DataFrame() # B) during

def DF_LIST_append(result):
    #DF_LIST.append(result) # A) end
    global DF # B) during
    DF = pd.concat([DF,result], ignore_index=True) # B) during

pool = Pool(processes=8)

for file in file_list:
    pool.apply_async(my_read, args = (file,), callback = DF_LIST_append)

pool.close()
pool.join()

#DF = pd.concat(DF_LIST, ignore_index=True) # A) end

print(DF.shape)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM