[英]More efficient way to preprocess large amount of text in a pandas df?
I have a series of text preprocessing steps organized in the following text_preprocessing()
function below.我在下面的text_preprocessing()
函数中组织了一系列文本预处理步骤。 (There are some more to it, such as converting emojis, removing punctuations etc., I dropped those for clarity.) (还有更多内容,例如转换表情符号、删除标点符号等,为了清楚起见,我删除了这些内容。)
import spacy
nlp_model = spacy.load('en_core_web_lg')
nlp_model.add_pipe("merge_entities")
def text_preprocessing(text, lemmatizer):
text = text.lower()
text = " ".join([lemmatizer.lemmatize(w) for w in text.split()])
text = [w if not re.search(r'[^\x00-\x7F]', w) else "<FOREIGN>" for w in text.split()]
text = [w.text if (not w.ent_type_ or w.ent_type_ == 'PERSON' or w.ent_type_ == 'ORG')
else f"<{w.ent_type_}>" for w in nlp_model(" ".join(text))]
text = " ".join(text)
text = re.sub(r"<\s([A-Z]+?)", r"<\1", text)
text = re.sub(r"([A-Z]+?)\s>", r"\1>", text)
text = re.sub(r"\s(<[A-Z]+?)\s", r" \1> ", text)
text = re.sub(r"\s([A-Z]+?>)\s", r" <\1 ", text)
text = " ".join([w.upper() if ("<" in w and ">" in w) else w for w in text.split()])
return text
At the moment, I have a working solution which is as follows:目前,我有一个工作解决方案如下:
from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
df['Preprocessed'] = df['Text'].apply(lambda x: text_preprocessing(x, lmtzr))
I already moved the instantiation of WordNetLemmatizer
outside of text_preprocessing()
and passed the instance as an argument.我已经将WordNetLemmatizer
的实例化移到text_preprocessing()
之外, text_preprocessing()
实例作为参数传递。 Now I am thinking of further optimizing this code as the database of messages to run on has increased considerably and is now nearing 30,000 rows (30,000 texts to preprocess, and the amount is growing day-by-day).现在我正在考虑进一步优化此代码,因为要运行的消息数据库已显着增加,现在接近 30,000 行(30,000 个文本要预处理,而且数量每天都在增加)。 Text preprocessing one-by-one takes plenty of hours already.逐一文本预处理已经花费了大量时间。 I tried multiprocessing.Process
earlier but didn't make much of an impact.我之前尝试过multiprocessing.Process
但没有产生太大影响。 I read about vectorization but I'm unsure how it could be applied to my situation.我阅读了有关矢量化的内容,但不确定如何将其应用于我的情况。 I'm also aware of external packages that apparently make it easier to set up multiprocessing for df.apply()
, such as the swifter
module, but I am hoping to speed things up a bit more than 2-4 times since I already have quite a lot of data and this will be even more in the future.我也知道外部的包,显然更容易建立多重处理df.apply()
如swifter
的模块,但是我希望能加快速度有点超过2-4次,因为我已经有相当多的数据,将来会更多。
Example data can be created with the following function:可以使用以下函数创建示例数据:
import pandas as pd
import numpy as np
import random
def generate_example_data(rows=100):
df = pd.DataFrame(np.random.randint(0,100,size=(rows, 4)), columns=list('ABCD'))
df['Text'] = pd.Series(["".join([random.choice("aábcčdeëfghijklmnoópqrsştuvwxyz ") for i in range(random.randint(25,400))]) for j in range(rows)])
return df
I think, your reading re vector calculation is the key here and I would go that way before considering multithreading or multiprocessing.我认为,您阅读重新矢量计算是这里的关键,在考虑多线程或多处理之前,我会这样做。 There are some operations that can be vectorized on your DataFrame already.有一些操作可以在您的 DataFrame 上进行矢量化。 You shouldnt care too much about adding columns to your frame.您不应该太在意向框架添加列。
for example your first operation例如你的第一次操作
text = text.lower()
can be replaced with可以替换为
df['some_col'].str.lower()
Furthermore, you can substitute some of your regex operations using this thread.此外,您可以使用此线程替换某些正则表达式操作。 Also, i find this a good source.另外,我发现这是一个很好的来源。 In addition, try (not too sure your case is a good fit) as much as you can to make use of numpy library.此外,尽可能多地尝试(不太确定您的情况是否合适)使用 numpy 库。
good luck祝你好运
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.