繁体   English   中英

给文本中的某个单词编号

[英]Numbering a certain word in a text

我想为文本中的某些单词提供参考编号(数字)格式。

通过使用下面的代码,我确实得到了一些正确的 output。 但是,当形容词有相同的单词或单词有附录时,它就不起作用了。

我能想到的所有边缘情况都是这两个,当有相同的单词包括形容词,然后如果一个单词在文本中有附录,则能够匹配字典中的单词。

试过这个,

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for keyword, number in words_to_number.items():
    pattern = r"\b"+keyword+r"\b"
    text = re.sub(pattern, keyword+" ("+str(number)+")", text)

print(text)

明白了,这是第一个样本 (3) (1),这是第二个样本 (3) (2)。

而不是,这是第一个样本 (1),这是第二个样本 (2)。

这里的问题是您在匹配关键字后将其放回原处,因此仍然可以匹配作为关键字前缀的后续关键字(可能)。

考虑一下当您不将匹配的关键字放回去时会发生什么:

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for keyword, number in words_to_number.items():
    pattern = rf"\b{keyword}\b"
    text = re.sub(pattern, f"({number})", text)

print(text)  # This is a (1) and this is a (2).

要解决此问题,您可以使用数字作为占位符并将每个关键字放回第二个 for 循环中:

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for keyword, number in words_to_number.items():
    pattern = rf"\b{keyword}\b"
    text = re.sub(pattern, f"({number})", text)

print(text)  # This is a (1) and this is a (2).

for keyword, number in words_to_number.items():
    pattern = rf"\({number}\)"
    text = re.sub(pattern, f"{keyword} ({number})", text)

print(text)  # This is a first sample (1) and this is a second sample (2).

作为单个语句,使用|制作单个正则表达式分隔不同的正则表达式并使用re.sub回调选项。

import re

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}


regex = r"|".join("({})".format(k) for k in words_to_number)

text_new = re.sub(regex, lambda m: r"{} ({})".format(
                    m.group(), words_to_number[m.group()]) , text)

print(text_new)

我个人会完全放弃正则表达式并使用 Fractalism 的方法:

text = "This is a first sample and this is a second sample."
words_to_number = {"first sample": 1, "second sample": 2, "sample": 3}

for word, number in words_to_number.items():
    text = text.replace(word, str(number))

for word, number in words_to_number.items():
    text = text.replace(str(number), f"{word} ({number})")

在这种情况下,正则表达式似乎有点矫枉过正,因为您只匹配没有其他模式的预定义字符串。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM