简体   繁体   中英

How can I write each process from multiprocessing to a separate csv file using pandas?

I have several large txt files. Call them mytext01.txt, mytext02.txt, mytext03.txt (in reality there are many more than three). I want to create a separate dataframe for each file that counts occurrences of certain keywords and then write each dataframe to its own csv file. I'd like each txt file to be handled in one process using the multiprocessing library.

I have written code that I thought would do what I wanted, but the csv file never appeared (the code doesn't seem to be doing much of anything-the entire thing runs more quickly than it would normally take to just load a single file). Here is a simplified version of what I tried:

import pandas as pd
from multiprocessing import Pool
keywords=['dog','cat','fish']
def count_words(file_number):
    file=path+'mytext{}.txt'.format(file_number)
    with open(file, 'r',encoding='utf-8') as f:
        text = f.read()
    text=text.split(' ')
    words_dict=dict(zip(positive,[0 for i in words]))
    for word in words_dict.keys():
        words_dict[word]=text.count(word)
    words_df=pd.DataFrame.from_dict(words_dict,orient='index')
    words_df.to_csv('word_counts{}.csv'.format(file_number))

if __name__ == '__main__':
    pool = Pool()
    pool.map(count_words, ['01','02','03'])



I'm not super familiar with using multiprocessing, so any idea of what I have done wrong would be much appreciated. Thanks!

In my experience it's better to have a dedicated function for parallelization as

import multiprocessing as mp


def parallelize(fun, vec, cores):
    with mp.Pool(cores) as p:
        res = p.map(fun, vec)
    return res

Now you just have to check if your function count_words works for a single file_number and you can use parallelize .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM