tfidf矢量化器的前2000个单词的共现矩阵

Question

我计算了文本数据的tfidf矢量化器，得到的矢量为（100000,2000）max_feature = 2000。

而我正在通过下面的代码计算共现矩阵。

length = 2000
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
    print(i)
    print(word)
    for j in range(max(i-window,0),min(i+window,length)):
        print(j)
        print(sentence[j])
        m[word,sentence[j]]+=1
for sentence in tf_vec:
    cal_occ(sentence, m)

我收到以下错误。

0
(0, 1210)   0.20426932204609685
(0, 191)    0.23516811545499153
(0, 592)    0.2537746177804585
(0, 1927)   0.2896119458034052
(0, 1200)   0.1624114163299802
(0, 1856)   0.24376566018277918
(0, 1325)   0.2789314085220367
(0, 756)    0.15365704375851477
(0, 1130)   0.293489555928974
(0, 346)    0.21231046306681553
(0, 557)    0.2036759579760878
(0, 1036)   0.29666992324872365
(0, 264)    0.36435609585838674
(0, 1701)   0.242619998334931
(0, 1939)   0.33934107208095693
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-96-ad505b6df734> in <module>()
 11             m[word,sentence[j]]+=1
 12 for sentence in tf_vec:
 ---> 13     cal_occ(sentence, m)

 <ipython-input-96-ad505b6df734> in cal_occ(sentence, m)
  9             print(j)
 10             print(sentence[j])
 ---> 11             m[word,sentence[j]]+=1
 12 for sentence in tf_vec:
 13     cal_occ(sentence, m)

IndexError：只有整数，切片（ : ），省略号（ ... ），numpy.newaxis（ None ）和整数或布尔数组是有效的索引

Answer 1

您最有可能在这里遇到问题：

for j in range(max(i-window,0),min(i+window,length)):

当i + window超出界限时， min函数返回长度，您可以尝试使用此方法代替上面的行：

for j in range(max(i-window,0),min(i+window,length-1)):

希望这可以帮助，

干杯

tfidf矢量化器的前2000个单词的共现矩阵

问题描述

1 个解决方案

解决方案1
0 2018-11-02 11:52:24

tfidf矢量化器的前2000个单词的共现矩阵

问题描述

1 个解决方案

解决方案1 0 2018-11-02 11:52:24

解决方案1
0 2018-11-02 11:52:24