将许多正则表达式操作组合在一起

Question

我正在使用 python 进行文本处理的 NLP 项目，在该项目中我需要在提取特征之前进行数据清理。 我正在使用正则表达式操作清理特殊字符和带有字符的数字分隔，但我在许多操作中单独执行所有这些操作，这使它变慢。 我想以尽可能少的操作或以更快的方式实现它。

我的代码如下

def remove_special_char(x):
    if type(x) is str:
        x = x.replace('-', ' ').replace('(', ',').replace(')', ',')
        x = re.compile(r"\s+").sub(" ", x).strip()
        x = re.sub(r'[^A-Z a-z 0-9-,.x]+', '', x).lower()
        x = re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", x).strip()
        x = x.replace(",,",",")
        return x
    else:
        return x

谁能帮我？

Answer 1

除了在函数之外准备已编译的模式，您还可以通过对所有一对一或一对无转换使用 translate 来获得一些性能：

import string
mappings     = {'-':' ', '(':',', ')':','}            # add more mappings as needed
mappings.update({ c:' ' for c in string.whitespace }) # white spaces become spaces
mappings.update({c:c.lower() for c in string.ascii_uppercase}) # set to lowercase
specialChars = str.maketrans(mappings)

def remove_special_char(x):
    x = x.translate(specialChars)
    ...
    return x

Answer 2

对于各种操作，您有不同的替换字符串，因此您无法真正合并它们。

您可以事先预编译所有正则表达式，但我怀疑它不会有太大区别：

paren_re = re.compile(r"[()]")
whitespace_re = re.compile(r"\s+")
ident_re = re.compile(r"[^A-Za-z0-9-,.x]+")
number_re = re.compile(r"([0-9]+(\.[0-9]+)?)")


def remove_special_char(x):
    if isinstance(x, str):
        x = x.replace("-", " ")
        x = paren_re.sub(",", x)
        x = whitespace_re.sub(" ", x)
        x = ident_re.sub("", x).lower()
        x = number_re.sub(r" \1 ", x).strip()
        x = x.replace(",,", ",")
    return x

您是否分析了您的程序以发现这是瓶颈？

将许多正则表达式操作组合在一起

问题描述

2 个解决方案

解决方案1
3 2020-04-02 20:50:34

解决方案2
2 2020-04-02 18:48:34

将许多正则表达式操作组合在一起

问题描述

2 个解决方案

解决方案1 3 2020-04-02 20:50:34

解决方案2 2 2020-04-02 18:48:34

解决方案1
3 2020-04-02 20:50:34

解决方案2
2 2020-04-02 18:48:34