提高数据预处理速度 - Python 中的正则表达式

Question

I use the following class in Python to preprocess a string before passing it to a machine learning classification model for predicting its sentiment.我使用 Python 中的以下类在将字符串传递给机器学习分类模型以预测其情绪之前对字符串进行预处理。

I use regex for most of the transformation along with some libraries like emoji and tweet-preprocessor.我使用正则表达式进行大部分转换以及一些库，如表情符号和推文预处理器。 The code works fine but I believe that it is slow.该代码工作正常，但我认为它很慢。

Do you have any suggestions on how to improve its speed?您对如何提高其速度有什么建议吗？

Example of usage:用法示例：

string  = "I am very happy with @easyjet #happy customer 🙂. Second sentence"
preprocessor = TextPreprocessing()
result = preprocessor.text_preprocessor(string)

The result will be : ["i am very happy with happy smiling face", "second sentence", "i am very happy with happy smiling face second sentence"]结果将是：[“我对幸福的笑脸很满意”、“第二句”、“我对幸福的笑脸第二句很满意”]

import re
import preprocessor as p   # this is the tweet-preprocessor library
import emoji
import os
import numpy as np
import pandas as pd

class TextPreprocessing:
    def __init__(self):
        p.set_options(p.OPT.MENTION, p.OPT.URL)

    # remove punctuation
    def _punctuation(self, val):
        val = re.sub(r'[^\w\s]', ' ', val)
        val = re.sub('_', ' ', val)
        return val

    #remove white spaces
    def _whitespace(self, val):
        return " ".join(val.split())

    #remove numbers
    def _removenumbers(self, val):
        val = re.sub('[0-9]+', '', val)
        return val

    #remove unicode
    def _remove_unicode(self, val):
        val = unidecode(val).encode("ascii")
        val = str(val, "ascii")
        return val

    #split string into sentenses
    def _split_to_sentences(self, body_text):
        sentences = re.split(
            r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", body_text)
        return sentences

    # cleaning functions that combines all of the above functions
    def _clean_text(self, val):
        val = val.lower()
        val = self._removenumbers(val)
        val = p.clean(val)
        val = ' '.join(self._punctuation(emoji.demojize(val)).split())
        val = self._remove_unicode(val)
        val = self._whitespace(val)
        return val

    def text_preprocessor(self, body_text):
        body_text_df = pd.DataFrame({"body_text": body_text}, index=[1])
        sentence_split_df = body_text_df.copy()
        sentence_split_df["body_text"] = sentence_split_df["body_text"].apply(
            self._split_to_sentences)

        lst_col = "body_text"
        sentence_split_df = pd.DataFrame(
            {
                col: np.repeat(
                    sentence_split_df[col].values, sentence_split_df[lst_col].str.len(
                    )
                )
                for col in sentence_split_df.columns.drop(lst_col)
            }
        ).assign(**{lst_col: np.concatenate(sentence_split_df[lst_col].values)})[
            sentence_split_df.columns
        ]

        final_df = (
            pd.concat([sentence_split_df, body_text_df])
            .reset_index()
            .drop(columns=["index"])
        )

        final_df["body_text"] = final_df["body_text"].apply(self._clean_text)

        return final_df["body_text"]

This question might be relevant to all those Data Scientists who want to move their NLP models into production.这个问题可能与所有想要将他们的 NLP 模型投入生产的数据科学家有关。

Answer 1

Since I cannot comment I will try to answer your question (to some extent):由于我无法发表评论，我将尝试回答您的问题（在某种程度上）：

You should clarify how to measure the execution time improvement.您应该阐明如何衡量执行时间的改进。 Use timeit and its repeat functionality for that:为此使用 timeit 及其重复功能：

import timeit
from functools import partial
...
if __name__ == "__main__":
    # http://25.io/toau/audio/sample.txt
    with open("sample.txt") as f:
        text = f.read()
        tp = TextPreprocessing()
        print(min(timeit.Timer(partial(tp.text_preprocessor, text)).repeat(repeat=10, number=1)))

You can also use timeit on specific methdos to check for bottlenecks.您还可以在特定方法上使用 timeit 来检查瓶颈。

Sadly I could not run your code sample due to the undefined np.遗憾的是，由于未定义的np.我无法运行您的代码示例np. in L58 and L64 so I cannot test my assumptions.在 L58 和 L64 中，所以我无法测试我的假设。 Also you did not provide sample data.你也没有提供样本数据。
Some general thoughts:一些普遍的想法：

Use re.compile() to compile all of your regular expressions使用re.compile()编译所有的正则表达式
If you do not need the modularity of _remove* you might consider combining these regular expressions如果您不需要 _remove* 的模块化，您可以考虑组合这些正则表达式
.copy() operations are expensive try to get rid of them .copy()操作很昂贵尝试摆脱它们
Also some of the _remove* methods have a smell.还有一些 _remove* 方法有气味。 See the links for alternatives:查看替代方案的链接：
- Remove zero width space unicode character from Python string 从 Python 字符串中删除零宽度空格 unicode 字符
- Removing numbers from string 从字符串中删除数字

提高数据预处理速度 - Python 中的正则表达式

问题描述

1 个解决方案

解决方案1
-1 已采纳 2020-10-13 10:43:38

提高数据预处理速度 - Python 中的正则表达式

问题描述

1 个解决方案

解决方案1 -1 已采纳 2020-10-13 10:43:38

解决方案1
-1 已采纳 2020-10-13 10:43:38