简体   繁体   English

Python 中的正则表达式:仅当不在列表中时将单词与数字分开

[英]Regex in Python: Separate words from numbers JUST when not in list

I have a list containing some substitutions which I need to keep.我有一个列表,其中包含一些我需要保留的替换。 For instance, the substitution list: ['1st', '2nd', '10th', '100th', '1st nation', 'xlr8', '5pin', 'h20'] .例如,替换列表: ['1st', '2nd', '10th', '100th', '1st nation', 'xlr8', '5pin', 'h20']

In general, strings containing alphanumeric characters need to split numbers and letters as follows:一般来说,包含字母数字字符的字符串需要拆分数字和字母如下:

text = re.sub(r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)', ' ', text, 0, re.IGNORECASE)

The previous regex pattern is separating successfully all numbers from characters by adding space between in the following:前面的正则表达式模式通过在以下内容之间添加空格成功地将所有数字与字符分开:

Original       Regex
ABC10 DEF  --> ABC 10 DEF
ABC DEF10  --> ABC DEF 10
ABC 10DEF  --> ABC 10 DEF
10ABC DEF  --> 10 ABC DEF

However, there are some alphanumeric words that are part of the substitution list which cannot be separated.但是,有一些字母数字单词是替换列表的一部分,不能分开。 For instance, the following string containing 1ST which is part of substitution list should not been separated and they should be omitted instead of adding an space:例如,以下包含1ST字符串是替换列表的一部分,不应将其分开,而应将其省略而不是添加空格:

Original            Regex                Expected
1ST DEF 100CD  -->  1 ST DEF 100 CD  --> 1ST DEF 100 CD
ABC 1ST 100CD  -->  ABC 1 ST 100 CD  --> ABC 1ST 100 CD
100TH DEF 100CD ->  100 TH DEF 100 CD -> 100TH DEF 100 CD
10TH DEF 100CD  ->  10 TH DEF 100 CD  -> 10TH DEF 100 CD 

To get the expected column in the above example, I tried to use IF THEN ELSE approach in regex, but I am getting an error in the syntax in Python:为了获得上述示例中的预期列,我尝试在正则表达式中使用IF THEN ELSE方法,但在 Python 中出现语法错误:

(?(?=condition)(then1|then2|then3)|(else1|else2|else3))

Based on the syntax, I should have something like the following:根据语法,我应该有如下内容:

?(?!1ST)((?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)))

where (?!...) would include the possible substitutions to avoid when matching the regex pattern, in this case the words 1ST 10TH 100TH .其中(?!...)将包括匹配正则表达式模式时要避免的可能替换,在本例中为单词1ST 10TH 100TH

How can I avoid matching word substitutions in the string?如何避免匹配字符串中的单词替换?

You can do this with a lambda function to check whether the matched string was in your exclusion list:您可以使用 lambda 函数来检查匹配的字符串是否在您的排除列表中:

import re

subs = ['1st','2nd','1st nation','xlr8','5pin','h20']
text = """
ABC10 DEF
1ST DEF 100CD
ABC 1ST 100CD
AN XLR8 45X
NO H20 DEF
A4B PLUS
"""

def add_spaces(m):
    if m.group().lower() in subs:
        return m.group()
    res = m.group(1)
    if len(res):
        res += ' '
    res += m.group(2)
    if len(m.group(3)):
        res += ' '
    res += m.group(3)
    return res

text = re.sub(r'\b([^\d\s]*)(\d+)([^\d\s]*)\b', lambda m: add_spaces(m), text)
print(text)

Output:输出:

ABC 10 DEF
1ST DEF 100 CD
ABC 1ST 100 CD
AN XLR8 45 X
NO H20 DEF
A 4 B PLUS

You can simplify the lambda function to您可以将 lambda 函数简化为

def add_spaces(m):
    if m.group().lower() in subs:
        return m.group()
    return m.group(1) + ' ' + m.group(2) + ' ' + m.group(3)

but this might result in extra whitespace in the output string.但这可能会导致输出字符串中出现额外的空格。 That could then be removed with然后可以删除

text = re.sub(r' +', ' ', text)

Another way using regex , (*SKIP)(*FAIL) and f-strings :另一种使用regex , (*SKIP)(*FAIL)f-strings

import regex as re

lst = ['1st','2nd','1st nation','xlr8','5pin','h20']

data = """
ABC10 DEF
ABC DEF10
ABC 10DEF
10ABC DEF
1ST DEF 100CD
ABC 1ST 100CD"""

rx = re.compile(
    rf"""
    (?:{"|".join(item.upper() for item in lst)})(*SKIP)(*FAIL)
    |
    (?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)
    """, re.X)

data = rx.sub(' ', data)
print(data)

This yields这产生

ABC 10 DEF
ABC DEF 10
ABC 10 DEF
10 ABC DEF
1ST DEF 100 CD
ABC 1ST 100 CD

When you deal with exceptions, the easiest and safest way is to use a " best trick ever " approach.当您处理异常时,最简单和最安全的方法是使用“有史以来最好的技巧”方法。 When replacing, this trick means: keep what is captured, remove what is matched or vice versa.替换时,这个技巧意味着:保留捕获的内容,删除匹配的内容,反之亦然。 In regex terms, you must use an alternation and use a capturing group around one (or some in complex scenarios) of them to be able to analyze the match structure after the match is encountered.在正则表达式方面,您必须使用交替并围绕其中一个(或在复杂场景中的一些)周围使用捕获组,以便能够在遇到匹配后分析匹配结构。

So, at first , use the exception list to build the first part of the alternation:因此,首先,使用异常列表来构建交替的第一部分:

exception_rx = "|".join(map(re.escape, exceptions))

Note re.escape adds backslashes where needed to support any special characters in the exceptions.注意re.escape在需要支持异常中的任何特殊字符的地方添加反斜杠。 If your exceptions are all alphanumeric, you do not need that and you can just use exception_rx = "|".join(exceptions) .如果您的异常都是字母数字,则不需要,您可以使用exception_rx = "|".join(exceptions) Or even exception_rx = rf'\\b(?:{"|".join(exceptions)})\\b' to only match them as whole words.甚至exception_rx = rf'\\b(?:{"|".join(exceptions)})\\b'只匹配它们作为整个单词。

Next , you need the pattern that will find all matches regardless of context, the one I already posted :接下来,您需要一种无论上下文如何都能找到所有匹配项的模式,即我已经发布的模式

generic_rx = r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)'

Finally , join them using the (exceptions_rx)|generic_rx scheme:最后,使用(exceptions_rx)|generic_rx方案加入它们:

rx = re.compile(rf'({exception_rx})|{generic_rx}', re.I)   

and replace using .sub() :并使用.sub()替换:

s = rx.sub(lambda x: x.group(1) or " ", s)

Here, lambda x: x.group(1) or " " means return Group 1 value if Group 1 matched, else, replace with a space .这里, lambda x: x.group(1) or " "表示如果 Group 1 匹配,则返回 Group 1 值,否则,替换为空格

See the Python demo :请参阅Python 演示

import re

exceptions = ['1st','2nd','10th','100th','1st nation','xlr8','5pin','h20', '12th'] # '12th' added
exception_rx = '|'.join(map(re.escape, exceptions))
generic_rx = r'(?<=\d)(?=[^\d\s])|(?<=[^\d\s])(?=\d)'
rx = re.compile(rf'({exception_rx})|{generic_rx}', re.I)

string_lst = ['1ST DEF 100CD','ABC 1ST 100CD','WEST 12TH APARTMENT']
for s in string_lst:
    print(rx.sub(lambda x: x.group(1) or " ", s))

Output:输出:

1ST DEF 100 CD
ABC 1ST 100 CD
WEST 12TH APARTMENT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 中的正则表达式:仅当不在列表中时将单词与数字分开(变量异常) - Regex in Python: Separate words from numbers JUST when not in list (Variable exception) 如何从 python 中的数字和单词的原始列表中创建仅包含数字和单词/短语的新列表? - How to create a new list with just numbers and words/phrases from a original list with both numbers and words in python? 正则表达式Python用于数字/单词 - Regex Python For Numbers/Words Python正则表达式将以空格分隔的单词分隔为一个列表 - Python regex separate space-delimited words into a list Python 正则表达式匹配列表中的多个单词 - Python regex matching multiple words from a list 是否有更好的方法从python列表中仅获取“重要单词”? - Is there a better way to get just 'important words' from a list in python? 从列表列表中提取单词并将它们存储在python中的单独变量中 - Extract words from list of lists and store them in a separate variable in python 如何从 python 中的数字和单词列表中提取特定项目? - how to extract specific items from the list with numbers and words in python? Python正则表达式编译和搜索带有数字和单词的字符串 - Python regex compile and search strings with numbers and words Python - 用正则表达式模式替换 DataFrame 中列表中的单词 - Python - Replacing words from list in DataFrame with Regex pattern
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM