简体   繁体   中英

How to speed up computation time for stopword removal and lemmatization in NLP

As part of pre-processing for a text classification model, I have added stopword removal and lemmatization steps, using the NLTK library. The code is below:

import pandas as pd
import nltk; nltk.download("all")
from nltk.corpus import stopwords; stop = set(stopwords.words('english'))
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

# Stopwords removal


def remove_stopwords(entry):
  sentence_list = [word for word in entry.split() if word not in stopwords.words("english")]
  return " ".join(sentence_list)

df["Description_no_stopwords"] = df.loc[:, "Description"].apply(lambda x: remove_stopwords(x))

# Lemmatization

lemmatizer = WordNetLemmatizer()

def punct_strip(string):
  s = re.sub(r'[^\w\s]',' ',string)
  return s

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def lemmatize_rows(entry):
  sentence_list = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in punct_strip(entry).split()]
  return " ".join(sentence_list)

df["Description - lemmatized"] = df.loc[:, "Description_no_stopwords"].apply(lambda x: lemmatize_rows(x))

The problem is that, when I pre-process a dataset with 27k entries (my test set), it takes 40-45 seconds for stopwords removal and just as long for lemmatization. By contrast, model evaluation only takes 2-3 seconds.

How can I re-write the functions to optimise computation speed? I have read something about vectorization, but the example functions were much simpler than the ones that I have reported, and I wouldn't know how to do it in this case.

A similar question was asked here and suggests that you try caching the stopwords.words("english") object. In your method remove_stopwords you are creating the object every time you evaluate an entry. So, you can definitely improve that. Regarding your lemmatizer , as mentioned here , you can also cache your results to improve performance. I can imagine that your pandas operations are also quite expensive. You may consider converting your dataframe into an array or dictionary and then iterating over it. If you need a dataframe later, you can easily convert it back.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM