在 Pandas 数据框中的列上应用函数的更快方法

Question

I have to apply some multiple functions to a column to get the list of bigrams but it is painfully slow using the apply function the way I'm currently using.我必须对一列应用一些多个函数来获取二元组列表，但是以我目前使用的方式使用 apply 函数非常缓慢。 Do you have a way to boost the speed?有没有办法提高速度？

def remove_stop_words(text):
    cleantext = text.lower()
    cleantext = ' '.join(re.sub(r'[^\w]', ' ', cleantext).strip().split())
    filtered_sentence= ''
    for w in word_tokenize(cleantext):
        if w not in stop_words: 
            filtered_sentence =  filtered_sentence + ' ' + w
    return  filtered_sentence.strip()

def lemmatize(text):
    lemma_word = []
    for w in word_tokenize(text.lower()):
        word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
        word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
        word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
        lemma_word.append(word3)
    return ' '.join(lemma_word)

def get_ngrams(text, n ):
    n_grams = ngrams(word_tokenize(text), n=2)
    return [ ' '.join(grams) for grams in n_grams]


df['bigrams'] = df.headline.apply(lambda x: get_ngrams(lemmatize(remove_stop_words(x)),n=2))

Edit: (based on comment) The data frame df contains 2 columns - 1. headline 2. Sentiment score编辑：（基于评论）数据框 df 包含 2 列 - 1. 标题 2. 情绪得分

headline - It's news headline, basically text on which I've to apply the function to get the bigrams of the headline标题 - 这是新闻标题，基本上是文本，我必须在其上应用该函数来获取标题的二元组

Sentiment Score - I've to keep the score as well in the df dataframe hence need to get a column called "bigram" in the same data frame情绪分数 - 我也必须在 df 数据框中保留分数，因此需要在同一数据框中获得一个名为“bigram”的列

Dataframe df数据框df

Answer 1

I found the best way to do this was to parallelize the process using the multiprocessing library.我发现最好的方法是使用多处理库并行化进程。

import numpy as np
import pandas as pd
import re
import time
from nltk import pos_tag, word_tokenize
from nltk.util import ngrams
import nltk
from nltk.corpus import stopwords
import nltk.data
from nltk.stem import WordNetLemmatizer
import random
from multiprocessing import  Pool


def get_ngrams(text, n=2 ):
    n_grams = ngrams(text.split(), n=n)
    return [ ' '.join(grams) for grams in n_grams]


def bigrams(df):
    df['bigrams'] = df.headline.apply(lambda x: get_ngrams(lemmatize(remove_stop_words(x)),n=2))
    return df

def parallelize_dataframe(df, func, n_cores=20):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

df2 = parallelize_dataframe(df, bigrams)
bigramScore = df2.explode('bigrams')

Note: This is useful only when you have a large number of cores available if you just have 2-3 cores available this might not be the best approach as the overhead cost of parallelizing the process is also to be considered.注意：这仅在您有大量可用内核时有用，如果您只有 2-3 个可用内核，这可能不是最佳方法，因为还需要考虑并行化过程的开销成本。

在 Pandas 数据框中的列上应用函数的更快方法

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-12-05 06:17:14

在 Pandas 数据框中的列上应用函数的更快方法

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-12-05 06:17:14

解决方案1
0 已采纳 2019-12-05 06:17:14