來自兩個以上字符串的最長公共單詞序列

Question

我正在嘗試在一個句子列表（兩個以上的句子）中找到最長的常見單詞序列。

例：

list = ['commercial van for movers', 'partial van for movers', 'commercial van for moving' ]
sents = pd.Series(list)

在此答案中，該解決方案可以正常工作，但可以捕獲部分單詞並返回以下內容：

'ial van for mov'

輸出應為

'van for'

我找不到修改它以返回所需輸出的方法

Answer 1

關鍵是要修改以按全字序列搜索。

from itertools import islice

def is_sublist(source, target):
    slen = len(source)
    return any(all(item1 == item2 for (item1, item2) in zip(source, islice(target, i, i+slen))) for i in range(len(target) - slen + 1))

def long_substr_by_word(data):
    subseq = []
    data_seqs = [s.split(' ') for s in data]
    if len(data_seqs) > 1 and len(data_seqs[0]) > 0:
        for i in range(len(data_seqs[0])):
            for j in range(len(data_seqs[0])-i+1):
                if j > len(subseq) and all(is_sublist(data_seqs[0][i:i+j], x) for x in data_seqs):
                    subseq = data_seqs[0][i:i+j]
    return ' '.join(subseq)

演示：

>>> data = ['commercial van for movers',
...         'partial van for movers',
...         'commercial van for moving']
>>> long_substr_by_word(data)
'van for'
>>>
>>> data = ['a bx bx z', 'c bx bx zz']
>>> long_substr_by_word(data)
'bx bx'

Answer 2

您可以創建第一個句子的所有子序列的有序冪集，然后在其他句子中搜索每個字符串，刪除未找到的子字符串。

最后，選擇空格最多的候選子字符串，如果出現平局，請選擇最長的子字符串。

from itertools import combinations

mylist = ['commercial van for movers', 
          'partial van for movers', 
          'commercial van for moving' ]

s0 = mylist[0].split()

candidates = [' '.join(s0[slice(*c)]) for c in combinations(list(range(len(s0)+1)), 2)]
for s in mylist:
    for i,c in reversed(list(enumerate(candidates.copy()))):
        if not c in s:
            candidates.pop(i)

max(candidates, key=lambda x: (x.count(' '), len(x)))
# returns:
'van for'

來自兩個以上字符串的最長公共單詞序列

問題描述

2 個解決方案

解決方案1
3 已采納 2017-11-03 16:30:44

解決方案2
0 2017-11-03 16:45:25

來自兩個以上字符串的最長公共單詞序列

問題描述

2 個解決方案

解決方案1 3 已采納 2017-11-03 16:30:44

解決方案2 0 2017-11-03 16:45:25

解決方案1
3 已采納 2017-11-03 16:30:44

解決方案2
0 2017-11-03 16:45:25