[英]How to tokenize compound words?
有一个原始列表元素,例如["southnorth"]
,我想根据列表["south", "north", "island"]
添加一个空格。 然后,只要我们基于标记化的列表包含['south', 'north']
,列表就会从['southnorth']
更改为 [ ['south','north']
south','north'] 。
但是,如果有一个列表["south", "island"]
那么列表["southnorth"]
应该保持原样。
我的想法如下:
list1= ['southnorth']
#list2= ['south','north','island']
list2=['south','island']
str1= " ".join(list1)
str2= " ".join(list2)
Get the alternators to apply regex:
list_compound = sorted(list1 + list2, key=len)
alternators = '|'.join(map(re.escape, list_compound)
regex = re.compile(r''.format(alternators)
str1_split = re.sub(r'({})'.format(alternators),r'\1 ',str1,0, re.IGNORECASE)
str2_split = re.sub(r'({})'.format(alternators),r'\1 ',str2,0, re.IGNORECASE)
但是,以上失败了,因为我需要确保序列的顺序。 例如,要分解["southnorth"]
我需要确保另一个列表具有["south", "north"]
。 否则,请保持原始格式。
不是最漂亮的解决方案,也可能不是最好的解决方案,但这是一个微不足道的蛮力尝试:
def tokenize(word, tokens):
tokenized_word = word
for t in tokens:
tokenized_word = tokenized_word.replace(t, f"{t} ").strip()
for w in tokenized_word.split(" "):
if w.strip() not in tokens:
return word
return tokenized_word
tokens = ["south", "north", "island"]
assert tokenize("south", tokens) == "south"
assert tokenize("southnorth", tokens) == "south north"
assert tokenize("islandsouthnorth", tokens) == "island south north"
assert tokenize("southwestnorth", tokens) == "southwestnorth"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.