[英]Combine strings if they are part of another larger string in the same list
給定句子列表和列表中可能包含的單詞,我想將它們從列表中排除,並將它們合並為最大的字符串(如果存在)。 此最大字符串的“部分”的每個外觀應計入最大字符串外觀的計數。
from collections import defaultdict
sentence_parts = ['quick brown', 'brown fox', 'fox', 'lazy dog',
'quick brown fox jumps over the lazy dog',]
sentences_with_count = defaultdict(int)
for s in sentence_parts:
matching_sentences = sorted([si for si in sentence_parts if s in si and len(si) > len(s)],
key=len, reverse=True)
if matching_sentences:
current_sent_count = sentences_with_count.get(s, 1)
sentences_with_count[matching_sentences[0]] += current_sent_count
else:
sentences_with_count[s] += 1
print(sentences_with_count)
所以輸出sentences_with_count
將是:
{
'quick brown fox jumps over the lazy dog': 5
}
這里的repl.it
我知道這根本沒有效率。 我該如何改善?
其他示例:
sentence_parts = ['The', 'Ohio State', 'Ohio',
'Paris, France', 'Paris',
'The Ohio State University']
>>> {'The Ohio State University': 4, 'Paris, France': 2}
sentence_parts = ['Obama', 'Barack', 'Barack Hussein Obama']
>>> {'Barack Hussein Obama': 3}
sentence_parts = ['Obama', 'Barack', 'Barack Hussein Obama',
'Steve', 'Jobs', 'Steve Jobs', 'Mark', 'Bob']
>>> {'Barack Hussein Obama': 3, 'Steve Jobs': 3, 'Mark': 1, 'Bob': 1}
這種方法的另一個問題:如果子字符串有多個匹配的字符串,則只會增加最大的計數:
sentence_parts = ['The', 'The New York City', 'The Voice']
>>> {'The New York City': 2, 'The Voice': 1}
理想情況下,輸出應為{'The New York City': 2, 'The Voice': 2}
它稍微短一些,並且可以解決您最后描述的問題,僅增加最大的問題。
sentence_parts = ['The', 'Ohio State', 'Ohio',
'Paris, France', 'Paris',
'The Ohio State University']
matching = {key:{'count':1, 'in': False} for key in sentence_parts}
for i in sentence_parts:
for i2 in sentence_parts:
if i in i2 and i != i2:
matching[i2]['count'] += 1
matching[i]['in'] = True
print({x: matching[x]['count'] for x in matching if not matching[x]['in']})
編輯:已刪除
sentence_parts = sorted(sentence_parts, key=len)
因為沒必要
編輯2 :通過使用列表理解縮短了詞典的創建。
以下解決方案從概念上將問題分為2個操作,
該解決方案更易於調試和將來擴展。
from collections import defaultdict
sentence_parts = ['The', 'Ohio State', 'Ohio',
'Paris, France', 'Paris',
'The Ohio State University']
sentences_with_count = defaultdict(int)
for part in sentence_parts:
for sentence in sentence_parts:
if part in sentence:
sentences_with_count[sentence] += 1
# sentences_with_count contains values for all parts.
# Next step is to filter the ones counted in bigger terms
sentence_keys = list(sentences_with_count.keys())
for k in sentence_keys:
for other in sentence_keys:
if k in other and k != other:
sentences_with_count.pop(k,None) # Remove consumed terms
break
print(sentences_with_count)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.