如何标记复合词？

Question

Having an original list element such as ["southnorth"] , I'd like to add a space based on the list ["south", "north", "island"] .有一个原始列表元素，例如["southnorth"] ，我想根据列表["south", "north", "island"]添加一个空格。 Then, the list would be changed from ['southnorth'] to ['south','north'] as long as the list which we base the tokenization contains ['south', 'north'] .然后，只要我们基于标记化的列表包含['south', 'north'] ，列表就会从['southnorth']更改为 [ ['south','north'] south','north'] 。

However, if there is a list ["south", "island"] then the list ["southnorth"] should be kept together as it is.但是，如果有一个列表["south", "island"]那么列表["southnorth"]应该保持原样。

I thought in something as follow:我的想法如下：

list1= ['southnorth']
#list2= ['south','north','island']
list2=['south','island']

str1= " ".join(list1)
str2= " ".join(list2)

Get the alternators to apply regex:
list_compound = sorted(list1 + list2, key=len)
alternators = '|'.join(map(re.escape, list_compound)
regex = re.compile(r''.format(alternators)

str1_split = re.sub(r'({})'.format(alternators),r'\1 ',str1,0, re.IGNORECASE)

str2_split = re.sub(r'({})'.format(alternators),r'\1 ',str2,0, re.IGNORECASE)

However, the above is failing because I need to ensure the order of the sequences.但是，以上失败了，因为我需要确保序列的顺序。 For instance, to decompose ["southnorth"] I need to ensure the other list has ["south", "north"] .例如，要分解["southnorth"]我需要确保另一个列表具有["south", "north"] 。 Otherwise, keep it in the original form.否则，请保持原始格式。

Answer 1

Not the prettiest solution and probably not the best performing, but here is a trivial brute force attempt:不是最漂亮的解决方案，也可能不是最好的解决方案，但这是一个微不足道的蛮力尝试：

def tokenize(word, tokens):
    tokenized_word = word
    for t in tokens:
        tokenized_word = tokenized_word.replace(t, f"{t} ").strip()

    for w in tokenized_word.split(" "):
        if w.strip() not in tokens:
            return word

    return tokenized_word


tokens = ["south", "north", "island"]

assert tokenize("south", tokens) == "south"
assert tokenize("southnorth", tokens) == "south north"
assert tokenize("islandsouthnorth", tokens) == "island south north"
assert tokenize("southwestnorth", tokens) == "southwestnorth"

如何标记复合词？

问题描述

1 个解决方案

解决方案1
1 2021-01-12 01:25:49

如何标记复合词？

问题描述

1 个解决方案

解决方案1 1 2021-01-12 01:25:49

解决方案1
1 2021-01-12 01:25:49