繁体   English   中英

如何在Python的for循环中使用多重处理来预处理pandas数据帧?

[英]How to use multiprocessing to pre-process a pandas dataframe in for loop in Python?

我有一个8500行文本的数据集。 我想对每个这些行应用一个函数pre_process 当我串行执行此操作时,在我的计算机上大约需要42分钟:

import pandas as pd
import time
import re

### constructing a sample dataframe of 10 rows to demonstrate
df = pd.DataFrame(columns=['text'])
df.text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

def pre_process(text):
    '''
    function to pre-process and clean text
    '''
    stop_words = ['in', 'of', 'at', 'a', 'the']

    # lowercase
    text=str(text).lower()

    # remove special characters except spaces, apostrophes and dots
    text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text)

    # remove stopwords
    text=[word for word in text.split(' ') if word not in stop_words]

    return ' '.join(text)

t = time.time()
for i in range(len(df)):
    df.text[i] = pre_process(df.text[i])

print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

>>> Time taken for pre-processing the data = 41.95724259614944 mins

因此,我想为这个任务使用多重处理。 我从这里得到帮助,并编写了以下代码:

import pandas as pd
import multiprocessing as mp

pool = mp.Pool(mp.cpu_count())

def func(text):
    return pre_process(text)

t = time.time()
results = pool.map(func, [df.text[i] for i in range(len(df))])
print('Time taken for pre-processing the data = {} mins'.format((time.time()-t)/60))

pool.close()

但是代码只是继续运行,并且不会停止。

我如何使它工作?

您可以使用pandas.DataFrame.apply

df.text= df.text.apply(pre_process)

以下代码对我有用。 我不使用func并立即使用pre_process 另外,我在池中使用上下文管理器/ with语句

下面是在IPython运行的代码。

In [1]: from multiprocessing import Pool, TimeoutError 
    ...: import time 
    ...: import os           

In [2]: text = ["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to 
    ...: make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
    ...:  
    ...:  "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a
    ...:  column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision
    ...:  of J.R.R. Tolkien 's Middle-earth .", 
    ...:  'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more s
    ...: imply intrusive to the story -- but the whole package certainly captures the intended , er , spi
    ...: rit of the piece .', 
    ...:  "You 'd think by now America would have had enough of plucky British eccentrics with hearts of 
    ...: gold .", 
    ...:  'Yet the act is still charming here .', 
    ...:  "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the
    ...:  self , '' Derrida is an undeniably fascinating and playful fellow .", 
    ...:  'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro o
    ...: f madness and light is astonishing .', 
    ...:  'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .', 
    ...:  "a screenplay more ingeniously constructed than `` Memento ''", 
    ...:  "`` Extreme Ops '' exceeds expectations ."]                       

In [3]: def pre_process(text): 
    ...:     ''' 
    ...:     function to pre-process and clean text 
    ...:     ''' 
    ...:     stop_words = ['in', 'of', 'at', 'a', 'the'] 
    ...:  
    ...:     # lowercase 
    ...:     text=str(text).lower() 
    ...:  
    ...:     # remove special characters except spaces, apostrophes and dots 
    ...:     text=re.sub(r"[^a-zA-Z0-9.']+", ' ', text) 
    ...:  
    ...:     # remove stopwords 
    ...:     text=[word for word in text.split(' ') if word not in stop_words] 
    ...:  
    ...:     return ' '.join(text) 


In [4]: %%time 
    ...: result = [] 
    ...: for x in text: 
    ...:     result.append(pre_process(x)) 
    ...:  
    ...:                                                                                                 
CPU times: user 500 µs, sys: 59 µs, total: 559 µs
Wall time: 569 µs

In [5]: %%time 
    ...: with Pool(mp.cpu_count()) as pool: 
    ...:     results = pool.map(pre_process, text) 
    ...:  
    ...:                                                                                          
CPU times: user 4.58 ms, sys: 29 ms, total: 33.6 ms
Wall time: 137 ms

In [6]: results                                                                                        
Out[6]: 
["rock is destined to be 21st century 's new conan '' and that he 's going to make splash even greater than arnold schwarzenegger jean claud van damme or steven segal .",
 "gorgeously elaborate continuation lord rings '' trilogy is so huge that column words can not adequately describe co writer director peter jackson 's expanded vision j.r.r. tolkien 's middle earth .",
 'singer composer bryan adams contributes slew songs few potential hits few more simply intrusive to story but whole package certainly captures intended er spirit piece .',
 "you 'd think by now america would have had enough plucky british eccentrics with hearts gold .",
 'yet act is still charming here .',
 "whether or not you 're enlightened by any derrida 's lectures on other '' and self '' derrida is an undeniably fascinating and playful fellow .",
 'just labour involved creating layered richness imagery this chiaroscuro madness and light is astonishing .',
 'part charm satin rouge is that it avoids obvious with humour and lightness .',
 "screenplay more ingeniously constructed than memento ''",
 " extreme ops '' exceeds expectations ."]

%%time是测量单元执行时间的IPython魔术。 当然,使用这样的小样本数据,由于创建新过程的开销,多处理实际上运行速度较慢。

无论如何,使用Pandas.DataFrame只需将列/ Series转换为如下的list()列表即可,而不是对其进行遍历,效率更高。

list(df.text)

下面是使用list()而不是像您那样反复迭代时的性能比较。 总计为197 µs与564 µs。

In [52]: %%time 
    ...: [s[i] for i in range(len(s))] 
    ...:  
    ...:                                                                                                
CPU times: user 499 µs, sys: 65 µs, total: 564 µs
Wall time: 506 µs
Out[52]: 
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

In [53]: %%time 
    ...: list(s) 
    ...:  
    ...:                                                                                                
CPU times: user 174 µs, sys: 23 µs, total: 197 µs
Wall time: 215 µs
Out[53]: 
["The Rock is destined to be the 21st Century 's new `` Conan '' and that he 's going to make a splash even greater than Arnold Schwarzenegger , Jean-Claud Van Damme or Steven Segal .",
 "The gorgeously elaborate continuation of `` The Lord of the Rings '' trilogy is so huge that a column of words can not adequately describe co-writer/director Peter Jackson 's expanded vision of J.R.R. Tolkien 's Middle-earth .",
 'Singer/composer Bryan Adams contributes a slew of songs -- a few potential hits , a few more simply intrusive to the story -- but the whole package certainly captures the intended , er , spirit of the piece .',
 "You 'd think by now America would have had enough of plucky British eccentrics with hearts of gold .",
 'Yet the act is still charming here .',
 "Whether or not you 're enlightened by any of Derrida 's lectures on `` the other '' and `` the self , '' Derrida is an undeniably fascinating and playful fellow .",
 'Just the labour involved in creating the layered richness of the imagery in this chiaroscuro of madness and light is astonishing .',
 'Part of the charm of Satin Rouge is that it avoids the obvious with humour and lightness .',
 "a screenplay more ingeniously constructed than `` Memento ''",
 "`` Extreme Ops '' exceeds expectations ."]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM