简体   繁体   English

Python:使用一系列字符查找所有可能的单词组合(分词)

[英]Python: find all possible word combinations with a sequence of characters (word segmentation)

I'm doing some word segmentation experiments like the followings. 我正在做一些像下面这样的分词实验。

lst is a sequence of characters, and output is all the possible words. lst是一系列字符, output是所有可能的单词。

lst = ['a', 'b', 'c', 'd']

def foo(lst):
    ...
    return output

output = [['a', 'b', 'c', 'd'],
          ['ab', 'c', 'd'],
          ['a', 'bc', 'd'],
          ['a', 'b', 'cd'],
          ['ab', 'cd'],
          ['abc', 'd'],
          ['a', 'bcd'],
          ['abcd']]

I've checked combinations and permutations in itertools library, 我已经检查了itertools库中的combinationspermutations
and also tried combinatorics . 并尝试过组合学
However, it seems that I'm looking at the wrong side because this is not pure permutation and combinations... 然而,似乎我在看错了,因为这不是纯粹的排列和组合......

It seems that I can achieve this by using lots of loops, but the efficiency might be low. 似乎我可以通过使用大量循环来实现这一点,但效率可能很低。

EDIT 编辑

The word order is important so combinations like ['ba', 'dc'] or ['cd', 'ab'] are not valid. 单词顺序很重要,因此['ba', 'dc']['cd', 'ab']等组合无效。

The order should always be from left to right. 订单应始终从左到右。

EDIT 编辑

@Stuart's solution doesn't work in Python 2.7.6 @Stuart的解决方案在Python 2.7.6中不起作用

EDIT 编辑

@Stuart's solution does work in Python 2.7.6, see the comments below. @Stuart的解决方案在Python 2.7.6中有效,请参阅下面的注释。

itertools.product should indeed be able to help you. itertools.product应该能够帮助你。

The idea is this:- Consider A1, A2, ..., AN separated by slabs. 这个想法是这样的: - 考虑由板块分隔的A1,A2,......,AN。 There will be N-1 slabs. 将有N-1板。 If there is a slab there is a segmentation. 如果有平板,则存在分段。 If there is no slab, there is a join. 如果没有平板,则有连接。 Thus, for a given sequence of length N, you should have 2^(N-1) such combinations. 因此,对于给定的长度为N的序列,您应该具有2 ^(N-1)个这样的组合。

Just like the below 就像下面这样

import itertools
lst = ['a', 'b', 'c', 'd']
combinatorics = itertools.product([True, False], repeat=len(lst) - 1)

solution = []
for combination in combinatorics:
    i = 0
    one_such_combination = [lst[i]]
    for slab in combination:
        i += 1
        if not slab: # there is a join
            one_such_combination[-1] += lst[i]
        else:
            one_such_combination += [lst[i]]
    solution.append(one_such_combination)

print solution

There are 8 options, each mirroring the binary numbers 0 through 7: 有8个选项,每个选项镜像二进制数0到7:

000
001
010
011
100
101
110
111

Each 0 and 1 represents whether or not the 2 letters at that index are "glued" together. 每个0和1表示该索引处的2个字母是否“粘合”在一起。 0 for no, 1 for yes. 0表示否,1表示是。

>>> lst = ['a', 'b', 'c', 'd']
... output = []
... formatstr = "{{:0{}.0f}}".format(len(lst)-1)
... for i in range(2**(len(lst)-1)):
...     output.append([])
...     s = "{:b}".format(i)
...     s = str(formatstr.format(float(s)))
...     lstcopy = lst[:]
...     for j, c in enumerate(s):
...         if c == "1":
...             lstcopy[j+1] = lstcopy[j] + lstcopy[j+1]
...         else:
...             output[-1].append(lstcopy[j])
...     output[-1].append(lstcopy[-1])
... output
[['a', 'b', 'c', 'd'],
 ['a', 'b', 'cd'],
 ['a', 'bc', 'd'],
 ['a', 'bcd'],
 ['ab', 'c', 'd'],
 ['ab', 'cd'],
 ['abc', 'd'],
 ['abcd']]
>>> 
#!/usr/bin/env python
from itertools import combinations
a = ['a', 'b', 'c', 'd']
a = "".join(a)
cuts = []
for i in range(0,len(a)):
    cuts.extend(combinations(range(1,len(a)),i))
for i in cuts:
    last = 0
    output = []
    for j in i:
        output.append(a[last:j])
        last = j
    output.append(a[last:])
    print(output)

output: 输出:

zsh 2419 % ./words.py  
['abcd']
['a', 'bcd']
['ab', 'cd']
['abc', 'd']
['a', 'b', 'cd']
['a', 'bc', 'd']
['ab', 'c', 'd']
['a', 'b', 'c', 'd']

You can use a recursive generator: 您可以使用递归生成器:

def split_combinations(L):
    for split in range(1, len(L)):
        for combination in split_combinations(L[split:]):
            yield [L[:split]] + combination
    yield [L]

print (list(split_combinations('abcd')))

Edit. 编辑。 I'm not sure how well this would scale up for long strings and at what point it hits Python's recursion limit. 我不确定这会扩展为长字符串,以及它在什么时候达到Python的递归限制。 Similarly to some of the other answers, you could also use combinations from itertools to work through every possible combination of split-points. 与其他一些答案类似,您也可以使用itertools combinations来处理每个可能的分裂点组合。

def split_string(s, t):
    return [s[start:finish] for start, finish in zip((None, ) + t, t + (None, ))]

def split_combinations(s):
    for i in range(len(s)):
        for split_points in combinations(range(1, len(s)), i):
            yield split_string(s, split_points)

These both seem to work as intended in Python 2.7 ( see here ) and Python 3.2 ( here ). 这两者似乎都按照Python 2.7( 参见此处 )和Python 3.2( 此处 )的预期工作。 As @twasbrillig says, make sure you indent it as shown. 正如@twasbrillig所说,请确保如图所示缩进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM