[英]Printing all possible phrases (consecutive combinations of words) in a given string
我正在嘗試在給定的文本中打印短語。 我希望能夠打印文本中的每個短語,從2個單詞到文本長度允許的最大單詞數。 我在下面編寫了一個程序,可以打印所有長度不超過5個單詞的短語,但是我無法找到一種更優雅的方式來打印所有可能的短語。
我對短語的定義=字符串中連續的單詞,無論含義如何。
def phrase_builder(i):
phrase_length = 4
phrase_list = []
for x in range(0, len(i)-phrase_length):
phrase_list.append(str(i[x]) + " " + str(i[x+1]))
phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]))
phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]))
phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]) + " " + str(i[x+4]))
return phrase_list
text = "the big fat cat sits on the mat eating a rat"
print phrase_builder(text.split())
輸出為:
['the big', 'the big fat', 'the big fat cat', 'the big fat cat sits',
'big fat', 'big fat cat', 'big fat cat sits', 'big fat cat sits on',
'fat cat', 'fat cat sits', 'fat cat sits on', 'fat cat sits on the',
'cat sits', 'cat sits on', 'cat sits on the', 'cat sits on the mat',
'sits on', 'sits on the', 'sits on the mat', 'sits on the mat eating',
'on the', 'on the mat', 'on the mat eating', 'on the mat eating a',
'the mat', 'the mat eating', 'the mat eating a', 'the mat eating a rat']
我希望能夠打印諸如"the big fat cat sits on the mat eating"
和"fat cat sits on the mat eating a rat"
等短語。
任何人都可以提供一些建議嗎?
只需使用itertools.combinations
from itertools import combinations
text = "the big fat cat sits on the mat eating a rat"
lst = text.split()
for start, end in combinations(range(len(lst)), 2):
print lst[start:end+1]
輸出:
['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['the', 'big', 'fat', 'cat', 'sits', 'on']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']
['big', 'fat', 'cat', 'sits', 'on']
['big', 'fat', 'cat', 'sits', 'on', 'the']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['fat', 'cat']
['fat', 'cat', 'sits']
['fat', 'cat', 'sits', 'on']
['fat', 'cat', 'sits', 'on', 'the']
['fat', 'cat', 'sits', 'on', 'the', 'mat']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['cat', 'sits']
['cat', 'sits', 'on']
['cat', 'sits', 'on', 'the']
['cat', 'sits', 'on', 'the', 'mat']
['cat', 'sits', 'on', 'the', 'mat', 'eating']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['sits', 'on']
['sits', 'on', 'the']
['sits', 'on', 'the', 'mat']
['sits', 'on', 'the', 'mat', 'eating']
['sits', 'on', 'the', 'mat', 'eating', 'a']
['sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['on', 'the']
['on', 'the', 'mat']
['on', 'the', 'mat', 'eating']
['on', 'the', 'mat', 'eating', 'a']
['on', 'the', 'mat', 'eating', 'a', 'rat']
['the', 'mat']
['the', 'mat', 'eating']
['the', 'mat', 'eating', 'a']
['the', 'mat', 'eating', 'a', 'rat']
['mat', 'eating']
['mat', 'eating', 'a']
['mat', 'eating', 'a', 'rat']
['eating', 'a']
['eating', 'a', 'rat']
['a', 'rat']
首先,您需要弄清楚如何以相同的方式編寫所有這四行。 代替手動連接單詞和空格,請使用join
方法:
phrase_list.append(" ".join(str(i[x+y]) for y in range(2))
phrase_list.append(" ".join(str(i[x+y]) for y in range(3))
phrase_list.append(" ".join(str(i[x+y]) for y in range(4))
phrase_list.append(" ".join(str(i[x+y]) for y in range(5))
如果join
方法內部的理解不清楚,請按以下步驟手動編寫:
phrase = []
for y in range(2):
phrase.append(str(i[x+y]))
phrase_list.append(" ".join(phrase))
完成此操作后,將這四行替換為一個循環很簡單:
for length in range(2, phrase_length):
phrase_list.append(" ".join(str(i[x+y]) for y in range(length))
您可以分別通過其他兩種方式簡化此操作。
首先,使用切片: i[x:x+length]
可以更輕松地完成i[x+y] for y in range(length)
i[x:x+length]
。
我猜i
已經是一個字符串列表,因此您可以擺脫str
調用。
此外, range
默認從0
開始,因此您可以將其保留。
當我們使用它時,如果使用有意義的變量名(例如words
而不是i
,考慮代碼將容易i
。
所以:
def phrase_builder(words):
phrase_length = 4
phrase_list = []
for i in range(len(words) - phrase_length):
phrase_list.append(" ".join(words[i:i+phrase_length]))
return phrase_list
現在,您的循環非常簡單,您可以將其轉變為一種理解,整個過程是一個單一的過程:
def phrase_builder(words):
phrase_length = 4
return [" ".join(words[i:i+phrase_length])
for i in range(len(words) - phrase_length)]
最后一件事:如@SoundDefense所問,您確定不希望“吃老鼠”嗎? 它從結尾開始少於5個單詞,但是在文本中是3個單詞的短語。
如果確實要刪除,只需刪除- phrase_length
部分。
您需要有一種系統的方法來枚舉每個可能的短語。
一種方法是從每個單詞開始,然后生成以該單詞開頭的所有可能的短語。
def phrase_builder(my_words):
phrases = []
for i, word in enumerate(my_words):
phrases.append(word)
for nextword in my_words[i+1:]:
phrases.append(phrases[-1] + " " + nextword)
# Remove the one-word phrase.
phrases.remove(word)
return phrases
text = "the big fat cat sits on the mat eating a rat"
print phrase_builder(text.split())
我認為最簡單的方法是遍歷words
列表中所有可能的start
和end
位置,並為words
的各個子列表生成短語:
def phrase_builder(words):
for start in range(0, len(words)-1):
for end in range(start+2, len(words)+1):
yield ' '.join(words[start:end])
text = "the big fat cat sits on the mat eating a rat"
for phrase in phrase_builder(text.split()):
print phrase
輸出:
the big
the big fat
...
the big fat cat sits on the mat eating a rat
...
sits on the mat eating a
...
eating a rat
a rat
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.