如何使用正则表达式替换 Python 中文本中的标记

Question

Given a list of tokens, I want to replace all the tokens in tokenized text with whitespace.给定一个标记列表，我想用空格替换标记化文本中的所有标记。

For example, given ['a', 'is'] and 'this is a test' , the result should be 'this test' .例如，给定['a', 'is']和'this is a test' ，结果应该是'this test' 。

I tried the code from How can I do multiple substitutions using regex in python?我尝试了如何在 python 中使用正则表达式进行多次替换？ , but the output is 'th test' . ，但 output 是'th test' 。

Besides, the list is long (about 1k tokens) and the text file is large.此外，列表很长（大约 1k 个令牌）并且文本文件很大。 so the speed is also important.所以速度也很重要。

Answer 1

This should solve your answer and be reasonable fast.这应该可以解决您的答案并且快速合理。 The token list is converted to a set so lookup can be done in O(1) time:令牌列表被转换为一个集合，因此可以在 O(1) 时间内完成查找：

tokens = ['a', 'is']
tokenized_text = 'this is a test'

val = ' '.join(word for word in tokenized_text.split(' ')
               if word not in set(tokens))
print(val)

Prints印刷

this test

如何使用正则表达式替换 Python 中文本中的标记

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-06-10 15:32:41

如何使用正则表达式替换 Python 中文本中的标记

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-06-10 15:32:41

解决方案1
0 已采纳 2020-06-10 15:32:41