Python 頻率分布 (FreqDist / NLTK) 問題

Question

我試圖將單詞列表（標記化字符串）分解為每個可能的 substring。然后我想在每個 substring 上運行 FreqDist，以找到最常見的 substring。第一部分工作正常。 但是，當我運行 FreqDist 時，出現錯誤：

TypeError: unhashable type: 'list'

這是我的代碼：

import nltk

string = ['This','is','a','sample']
substrings = []

count1 = 0
count2 = 0

for word in string:
    while count2 <= len(string):
        if count1 != count2:
            temp = string[count1:count2]
            substrings.append(temp)
        count2 += 1
    count1 +=1
    count2 = count1

print substrings

fd = nltk.FreqDist(substrings)

print fd

substrings的 output 沒問題。 這里是：

[['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'], ['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'], ['sample']]

但是，我無法讓 FreqDist 在其上運行。 任何見解將不勝感激。 在這種情況下，每個 substring 的 FreqDist 僅為 1，但該程序旨在運行更大的文本樣本。

Answer 1

我不完全確定你想要什么，但錯誤消息是說它想要 hash 列表，這通常是將它放入集合或將其用作字典鍵的標志。 我們可以通過給它元組來解決這個問題。

>>> import nltk
>>> import itertools
>>> 
>>> sentence = ['This','is','a','sample']
>>> contiguous_subs = [sentence[i:j] for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
>>> contiguous_subs
[['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'],
 ['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'],
 ['sample']]

但我們還有

>>> fd = nltk.FreqDist(contiguous_subs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 107, in __init__
    self.update(samples)
  File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 437, in update
    self.inc(sample, count=count)
  File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 122, in inc
    self[sample] = self.get(sample,0) + count
TypeError: unhashable type: 'list'

但是，如果我們將子序列變成元組：

>>> contiguous_subs = [tuple(sentence[i:j]) for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
>>> contiguous_subs
[('This',), ('This', 'is'), ('This', 'is', 'a'), ('This', 'is', 'a', 'sample'), ('is',), ('is', 'a'), ('is', 'a', 'sample'), ('a',), ('a', 'sample'), ('sample',)]
>>> fd = nltk.FreqDist(contiguous_subs)
>>> print fd
<FreqDist: ('This',): 1, ('This', 'is'): 1, ('This', 'is', 'a'): 1, ('This', 'is', 'a', 'sample'): 1, ('a',): 1, ('a', 'sample'): 1, ('is',): 1, ('is', 'a'): 1, ('is', 'a', 'sample'): 1, ('sample',): 1>

那是你要找的嗎？

Python 頻率分布 (FreqDist / NLTK) 問題

問題描述

1 個解決方案

解決方案1
6 已采納 2012-04-05 15:41:23

Python 頻率分布 (FreqDist / NLTK) 問題

問題描述

1 個解決方案

解決方案1 6 已采納 2012-04-05 15:41:23

解決方案1
6 已采納 2012-04-05 15:41:23