如果字符串是同一列表中另一個較大字符串的一部分，則合並它們

Question

給定句子列表和列表中可能包含的單詞，我想將它們從列表中排除，並將它們合並為最大的字符串（如果存在）。 此最大字符串的“部分”的每個外觀應計入最大字符串外觀的計數。

from collections import defaultdict

sentence_parts = ['quick brown', 'brown fox', 'fox', 'lazy dog',
                  'quick brown fox jumps over the lazy dog',]

sentences_with_count = defaultdict(int)

for s in sentence_parts:
    matching_sentences = sorted([si for si in sentence_parts if s in si and len(si) > len(s)],
                                key=len, reverse=True)
    if matching_sentences:
        current_sent_count = sentences_with_count.get(s, 1)
        sentences_with_count[matching_sentences[0]] += current_sent_count
    else:
        sentences_with_count[s] += 1

print(sentences_with_count)

所以輸出sentences_with_count將是：

{
    'quick brown fox jumps over the lazy dog': 5
}

這里的repl.it

我知道這根本沒有效率。 我該如何改善？

其他示例：

sentence_parts = ['The', 'Ohio State', 'Ohio', 
                  'Paris, France', 'Paris',
                  'The Ohio State University']

>>> {'The Ohio State University': 4, 'Paris, France': 2}

sentence_parts = ['Obama', 'Barack', 'Barack Hussein Obama']

>>> {'Barack Hussein Obama': 3}

sentence_parts = ['Obama', 'Barack', 'Barack Hussein Obama',
                  'Steve', 'Jobs', 'Steve Jobs', 'Mark', 'Bob']

>>> {'Barack Hussein Obama': 3, 'Steve Jobs': 3, 'Mark': 1, 'Bob': 1}

這種方法的另一個問題：如果子字符串有多個匹配的字符串，則只會增加最大的計數：

sentence_parts = ['The', 'The New York City', 'The Voice']
>>> {'The New York City': 2, 'The Voice': 1}

理想情況下，輸出應為{'The New York City': 2, 'The Voice': 2}

Answer 1

它稍微短一些，並且可以解決您最后描述的問題，僅增加最大的問題。

sentence_parts = ['The', 'Ohio State', 'Ohio', 
              'Paris, France', 'Paris',
              'The Ohio State University']
matching = {key:{'count':1, 'in': False} for key in sentence_parts}

for i in sentence_parts:
    for i2 in sentence_parts:
        if i in i2 and i != i2:
            matching[i2]['count'] += 1
            matching[i]['in'] = True

print({x: matching[x]['count'] for x in matching if not matching[x]['in']})

編輯：已刪除

sentence_parts = sorted(sentence_parts, key=len)

因為沒必要

編輯2 ：通過使用列表理解縮短了詞典的創建。

Answer 2

以下解決方案從概念上將問題分為2個操作，

查找每個句子的實際出現次數。
刪除所有已經用較大的句子計算出的句子。

該解決方案更易於調試和將來擴展。

from collections import defaultdict

sentence_parts =  ['The', 'Ohio State', 'Ohio',
                   'Paris, France', 'Paris',
                   'The Ohio State University']

sentences_with_count = defaultdict(int)
for part in sentence_parts:
    for sentence in sentence_parts:
        if part in sentence:
            sentences_with_count[sentence] += 1

# sentences_with_count contains values for all parts.
# Next step is to filter the ones counted in bigger terms

sentence_keys = list(sentences_with_count.keys())
for k in sentence_keys:
    for other in sentence_keys:
        if k in other and k != other:
            sentences_with_count.pop(k,None) # Remove consumed terms
            break

print(sentences_with_count)

如果字符串是同一列表中另一個較大字符串的一部分，則合並它們

問題描述

2 個解決方案

解決方案1
0 2017-09-29 08:42:54

解決方案2
0 2017-09-29 10:23:02

如果字符串是同一列表中另一個較大字符串的一部分，則合並它們

問題描述

2 個解決方案

解決方案1 0 2017-09-29 08:42:54

解決方案2 0 2017-09-29 10:23:02

解決方案1
0 2017-09-29 08:42:54

解決方案2
0 2017-09-29 10:23:02