简体   繁体   English

在pandas df中预处理大量文本的更有效方法?

[英]More efficient way to preprocess large amount of text in a pandas df?

I have a series of text preprocessing steps organized in the following text_preprocessing() function below.我在下面的text_preprocessing()函数中组织了一系列文本预处理步骤。 (There are some more to it, such as converting emojis, removing punctuations etc., I dropped those for clarity.) (还有更多内容,例如转换表情符号、删除标点符号等,为了清楚起见,我删除了这些内容。)

import spacy
nlp_model = spacy.load('en_core_web_lg')
nlp_model.add_pipe("merge_entities")

def text_preprocessing(text, lemmatizer):

    text = text.lower()
    text = " ".join([lemmatizer.lemmatize(w) for w in text.split()])
    text = [w if not re.search(r'[^\x00-\x7F]', w) else "<FOREIGN>" for w in text.split()]
    text = [w.text if (not w.ent_type_ or w.ent_type_ == 'PERSON' or w.ent_type_ == 'ORG')
            else f"<{w.ent_type_}>" for w in nlp_model(" ".join(text))]
    text = " ".join(text)
    text = re.sub(r"<\s([A-Z]+?)", r"<\1", text)
    text = re.sub(r"([A-Z]+?)\s>", r"\1>", text)
    text = re.sub(r"\s(<[A-Z]+?)\s", r" \1> ", text)
    text = re.sub(r"\s([A-Z]+?>)\s", r" <\1 ", text)
    text = " ".join([w.upper() if ("<" in w and ">" in w) else w for w in text.split()])
    return text

At the moment, I have a working solution which is as follows:目前,我有一个工作解决方案如下:

from nltk.stem import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
df['Preprocessed'] = df['Text'].apply(lambda x: text_preprocessing(x, lmtzr))

I already moved the instantiation of WordNetLemmatizer outside of text_preprocessing() and passed the instance as an argument.我已经将WordNetLemmatizer的实例化移到text_preprocessing()之外, text_preprocessing()实例作为参数传递。 Now I am thinking of further optimizing this code as the database of messages to run on has increased considerably and is now nearing 30,000 rows (30,000 texts to preprocess, and the amount is growing day-by-day).现在我正在考虑进一步优化此代码,因为要运行的消息数据库已显着增加,现在接近 30,000 行(30,000 个文本要预处理,而且数量每天都在增加)。 Text preprocessing one-by-one takes plenty of hours already.逐一文本预处理已经花费了大量时间。 I tried multiprocessing.Process earlier but didn't make much of an impact.我之前尝试过multiprocessing.Process但没有产生太大影响。 I read about vectorization but I'm unsure how it could be applied to my situation.我阅读了有关矢量化的内容,但不确定如何将其应用于我的情况。 I'm also aware of external packages that apparently make it easier to set up multiprocessing for df.apply() , such as the swifter module, but I am hoping to speed things up a bit more than 2-4 times since I already have quite a lot of data and this will be even more in the future.我也知道外部的包,显然更容易建立多重处理df.apply()swifter的模块,但是我希望能加快速度有点超过2-4次,因为我已经有相当多的数据,将来会更多。

Example data can be created with the following function:可以使用以下函数创建示例数据:

import pandas as pd
import numpy as np
import random
def generate_example_data(rows=100):
    df = pd.DataFrame(np.random.randint(0,100,size=(rows, 4)), columns=list('ABCD'))
    df['Text'] = pd.Series(["".join([random.choice("aábcčdeëfghijklmnoópqrsştuvwxyz    ") for i in range(random.randint(25,400))]) for j in range(rows)])
    return df

I think, your reading re vector calculation is the key here and I would go that way before considering multithreading or multiprocessing.我认为,您阅读重新矢量计算是这里的关键,在考虑多线程或多处理之前,我会这样做。 There are some operations that can be vectorized on your DataFrame already.有一些操作可以在您的 DataFrame 上进行矢量化。 You shouldnt care too much about adding columns to your frame.您不应该太在意向框架添加列。

for example your first operation例如你的第一次操作

text = text.lower()

can be replaced with可以替换为

df['some_col'].str.lower()

Furthermore, you can substitute some of your regex operations using this thread.此外,您可以使用线程替换某些正则表达式操作。 Also, i find this a good source.另外,我发现是一个很好的来源。 In addition, try (not too sure your case is a good fit) as much as you can to make use of numpy library.此外,尽可能多地尝试(不太确定您的情况是否合适)使用 numpy 库。

good luck祝你好运

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用 pandas df 中的 unicode 名称替换特殊字符的更有效方法 - More efficient way to replace special chars with their unicode name in pandas df 有没有更有效的方法从大文本文件创建倒排索引? - Is there a more efficient way to create an inverted index from a large text file? 从存储在 Pandas DF 列中的日期时间中减去小时数的更有效方法 - More efficient way to subtract hours from datetime both stored in Pandas DF columns 熊猫df中“前瞻”价值的有效方法 - Efficient way to “look ahead” values in a pandas df 有没有更快的方法来预处理 Python 中的大量文本数据? - Is there a faster way to preprocess huge amount of text data in Python? 导入功能并将功能应用于熊猫数据框中的文本数据的更有效方法 - More efficient way to import and apply a function to text data in a Pandas Dataframe 连接大量列表的更有效方法? - More efficient way of concatenating a huge amount of lists? 对于大型文本数据,如何更快地处理 pandas df 列中的文本? - How to make text processing in a pandas df column more faster for large textual data? 阅读大量词典的更有效方法 - More Efficient Way of Reading Large List of dicts 是否有更有效或更简洁的方法来根据索引列表划分 df? - Is there a more efficient or concise way to divide a df according to a list of indexes?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM