简体   繁体   English

将Spacy Parser应用于具有多处理功能的Pandas DataFrame

[英]Applying Spacy Parser to Pandas DataFrame w/ Multiprocessing

Say I have a dataset, like 说我有一个数据集,比如

iris = pd.DataFrame(sns.load_dataset('iris'))

I can use Spacy and .apply to parse a string column into tokens (my real dataset has >1 word/token per entry of course) 我可以使用Spacy.apply将字符串列解析为标记(我的真实数据集当然每个条目有> 1个字/标记)

import spacy # (I have version 1.8.2)
nlp = spacy.load('en')
iris['species_parsed'] = iris['species'].apply(nlp)

result: 结果:

   sepal_length   ... species    species_parsed
0           1.4   ... setosa          (setosa)
1           1.4   ... setosa          (setosa)
2           1.3   ... setosa          (setosa)

I can also use this convenient multiprocessing function ( thanks to this blogpost ) to do most arbitrary apply functions on a dataframe in parallel: 我还可以使用这个方便的多处理功能( 感谢这篇博文 )在数据帧上并行执行大多数任意应用函数:

from multiprocessing import Pool, cpu_count
def parallelize_dataframe(df, func, num_partitions):

    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_partitions)
    df = pd.concat(pool.map(func, df_split))

    pool.close()
    pool.join()
    return df

for example: 例如:

def my_func(df):
    df['length_of_word'] = df['species'].apply(lambda x: len(x))
    return df

num_cores = cpu_count()
iris = parallelize_dataframe(iris, my_func, num_cores)

result: 结果:

   sepal_length species  length_of_word
0           5.1  setosa               6
1           4.9  setosa               6
2           4.7  setosa               6

...But for some reason, I can't apply the Spacy parser to a dataframe using multiprocessing this way. ...但由于某种原因,我无法通过这种方式使用多处理将Spacy解析器应用于数据帧。

def add_parsed(df):
    df['species_parsed'] = df['species'].apply(nlp)
    return df

iris = parallelize_dataframe(iris, add_parsed, num_cores)

result: 结果:

   sepal_length species  length_of_word species_parsed
0           5.1  setosa               6             ()
1           4.9  setosa               6             ()
2           4.7  setosa               6             ()

Is there some other way to do this? 还有其他方法可以做到这一点吗? I'm loving Spacy for NLP but I have a lot of text data and so I'd like to parallelize some processing functions, but ran into this issue. 我喜欢Spacy的NLP,但我有很多文本数据,所以我想并行化一些处理函数,但遇到了这个问题。

Spacy is highly optimised and does the multiprocessing for you. Spacy经过高度优化,可为您进行多处理。 As a result, I think your best bet is to take the data out of the Dataframe and pass it to the Spacy pipeline as a list rather than trying to use .apply directly. 因此,我认为最好的办法是将数据从Dataframe中取出,并将其作为列表传递给Spacy管道,而不是直接尝试使用.apply

You then need to the collate the results of the parse, and put this back into the Dataframe. 然后,您需要整理解析的结果,并将其放回到Dataframe中。

So, in your example, you could use something like: 因此,在您的示例中,您可以使用以下内容:

tokens = []
lemma = []
pos = []

for doc in nlp.pipe(df['species'].astype('unicode').values, batch_size=50,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc])
        lemma.append([n.lemma_ for n in doc])
        pos.append([n.pos_ for n in doc])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

df['species_tokens'] = tokens
df['species_lemma'] = lemma
df['species_pos'] = pos

This approach will work fine on small datasets, but it eats up your memory, so not great if you want to process huge amounts of text. 这种方法在小数据集上可以正常工作,但它会占用你的记忆,所以如果你想处理大量的文本,那就太好了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM