在給定的字符串中打印所有可能的短語（單詞的連續組合）

Question

我正在嘗試在給定的文本中打印短語。 我希望能夠打印文本中的每個短語，從2個單詞到文本長度允許的最大單詞數。 我在下面編寫了一個程序，可以打印所有長度不超過5個單詞的短語，但是我無法找到一種更優雅的方式來打印所有可能的短語。

我對短語的定義=字符串中連續的單詞，無論含義如何。

def phrase_builder(i):
    phrase_length = 4
    phrase_list = []
    for x in range(0, len(i)-phrase_length):
        phrase_list.append(str(i[x]) + " " + str(i[x+1]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]) + " " + str(i[x+4]))
    return phrase_list

text = "the big fat cat sits on the mat eating a rat"

print phrase_builder(text.split())

輸出為：

['the big', 'the big fat', 'the big fat cat', 'the big fat cat sits',
'big fat', 'big fat cat', 'big fat cat sits', 'big fat cat sits on',
'fat cat', 'fat cat sits', 'fat cat sits on', 'fat cat sits on the',
'cat sits', 'cat sits on', 'cat sits on the', 'cat sits on the mat',
'sits on', 'sits on the', 'sits on the mat', 'sits on the mat eating',
'on the', 'on the mat', 'on the mat eating', 'on the mat eating a',
'the mat', 'the mat eating', 'the mat eating a', 'the mat eating a rat']

我希望能夠打印諸如"the big fat cat sits on the mat eating"和"fat cat sits on the mat eating a rat"等短語。

任何人都可以提供一些建議嗎？

Answer 1

只需使用itertools.combinations

from itertools import combinations
text = "the big fat cat sits on the mat eating a rat"
lst = text.split()
for start, end in combinations(range(len(lst)), 2):
    print lst[start:end+1]

輸出：

['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['the', 'big', 'fat', 'cat', 'sits', 'on']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']
['big', 'fat', 'cat', 'sits', 'on']
['big', 'fat', 'cat', 'sits', 'on', 'the']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['fat', 'cat']
['fat', 'cat', 'sits']
['fat', 'cat', 'sits', 'on']
['fat', 'cat', 'sits', 'on', 'the']
['fat', 'cat', 'sits', 'on', 'the', 'mat']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['cat', 'sits']
['cat', 'sits', 'on']
['cat', 'sits', 'on', 'the']
['cat', 'sits', 'on', 'the', 'mat']
['cat', 'sits', 'on', 'the', 'mat', 'eating']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['sits', 'on']
['sits', 'on', 'the']
['sits', 'on', 'the', 'mat']
['sits', 'on', 'the', 'mat', 'eating']
['sits', 'on', 'the', 'mat', 'eating', 'a']
['sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['on', 'the']
['on', 'the', 'mat']
['on', 'the', 'mat', 'eating']
['on', 'the', 'mat', 'eating', 'a']
['on', 'the', 'mat', 'eating', 'a', 'rat']
['the', 'mat']
['the', 'mat', 'eating']
['the', 'mat', 'eating', 'a']
['the', 'mat', 'eating', 'a', 'rat']
['mat', 'eating']
['mat', 'eating', 'a']
['mat', 'eating', 'a', 'rat']
['eating', 'a']
['eating', 'a', 'rat']
['a', 'rat']

Answer 2

首先，您需要弄清楚如何以相同的方式編寫所有這四行。 代替手動連接單詞和空格，請使用join方法：

phrase_list.append(" ".join(str(i[x+y]) for y in range(2))
phrase_list.append(" ".join(str(i[x+y]) for y in range(3))
phrase_list.append(" ".join(str(i[x+y]) for y in range(4))
phrase_list.append(" ".join(str(i[x+y]) for y in range(5))

如果join方法內部的理解不清楚，請按以下步驟手動編寫：

phrase = []
for y in range(2):
    phrase.append(str(i[x+y]))
phrase_list.append(" ".join(phrase))

完成此操作后，將這四行替換為一個循環很簡單：

for length in range(2, phrase_length):
    phrase_list.append(" ".join(str(i[x+y]) for y in range(length))

您可以分別通過其他兩種方式簡化此操作。

首先，使用切片： i[x:x+length]可以更輕松地完成i[x+y] for y in range(length) i[x:x+length] 。

我猜i已經是一個字符串列表，因此您可以擺脫str調用。

此外， range默認從0開始，因此您可以將其保留。

當我們使用它時，如果使用有意義的變量名（例如words而不是i ，考慮代碼將容易i 。

所以：

def phrase_builder(words):
    phrase_length = 4
    phrase_list = []
    for i in range(len(words) - phrase_length):
        phrase_list.append(" ".join(words[i:i+phrase_length]))
    return phrase_list

現在，您的循環非常簡單，您可以將其轉變為一種理解，整個過程是一個單一的過程：

def phrase_builder(words):
    phrase_length = 4
    return [" ".join(words[i:i+phrase_length]) 
            for i in range(len(words) - phrase_length)]

最后一件事：如@SoundDefense所問，您確定不希望“吃老鼠”嗎？ 它從結尾開始少於5個單詞，但是在文本中是3個單詞的短語。

如果確實要刪除，只需刪除- phrase_length部分。

Answer 3

您需要有一種系統的方法來枚舉每個可能的短語。

一種方法是從每個單詞開始，然后生成以該單詞開頭的所有可能的短語。

def phrase_builder(my_words):
   phrases = []
   for i, word in enumerate(my_words):
     phrases.append(word)
     for nextword in my_words[i+1:]:
        phrases.append(phrases[-1] + " " + nextword)
     # Remove the one-word phrase.
     phrases.remove(word)
   return phrases



text = "the big fat cat sits on the mat eating a rat"

print phrase_builder(text.split())

Answer 4

我認為最簡單的方法是遍歷words列表中所有可能的start和end位置，並為words的各個子列表生成短語：

def phrase_builder(words):
    for start in range(0, len(words)-1):
        for end in range(start+2, len(words)+1):
            yield ' '.join(words[start:end])

text = "the big fat cat sits on the mat eating a rat"
for phrase in phrase_builder(text.split()):
    print phrase

輸出：

the big
the big fat
...
the big fat cat sits on the mat eating a rat
...
sits on the mat eating a
...
eating a rat
a rat

在給定的字符串中打印所有可能的短語（單詞的連續組合）

問題描述

4 個解決方案

解決方案1
15 已采納 2014-07-25 22:00:13

解決方案2
2 2014-07-25 21:44:23

解決方案3
1 2014-07-25 21:44:43

解決方案4
1 2014-07-25 21:53:44

在給定的字符串中打印所有可能的短語（單詞的連續組合）

問題描述

4 個解決方案

解決方案1 15 已采納 2014-07-25 22:00:13

解決方案2 2 2014-07-25 21:44:23

解決方案3 1 2014-07-25 21:44:43

解決方案4 1 2014-07-25 21:53:44

解決方案1
15 已采納 2014-07-25 22:00:13

解決方案2
2 2014-07-25 21:44:23

解決方案3
1 2014-07-25 21:44:43

解決方案4
1 2014-07-25 21:53:44