简体   繁体   English

如何使用正则表达式替换 Python 中文本中的标记

[英]How to use regex to replace tokens in text in Python

Given a list of tokens, I want to replace all the tokens in tokenized text with whitespace.给定一个标记列表,我想用空格替换标记化文本中的所有标记。

For example, given ['a', 'is'] and 'this is a test' , the result should be 'this test' .例如,给定['a', 'is']'this is a test' ,结果应该是'this test'

I tried the code from How can I do multiple substitutions using regex in python?我尝试了如何在 python 中使用正则表达式进行多次替换? , but the output is 'th test' . ,但 output 是'th test'

Besides, the list is long (about 1k tokens) and the text file is large.此外,列表很长(大约 1k 个令牌)并且文本文件很大。 so the speed is also important.所以速度也很重要。

This should solve your answer and be reasonable fast.这应该可以解决您的答案并且快速合理。 The token list is converted to a set so lookup can be done in O(1) time:令牌列表被转换为一个集合,因此可以在 O(1) 时间内完成查找:

tokens = ['a', 'is']
tokenized_text = 'this is a test'

val = ' '.join(word for word in tokenized_text.split(' ')
               if word not in set(tokens))
print(val)

Prints印刷

this test

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM