[英]numpy - how to combine multiple indices (replace multiple one-by-one matrix access with one access)
The implementation did not consider multiple occurrences of a same word, and self word occurrences.该实现没有考虑同一个词的多次出现和自身词的出现。
For instance when stride=2 and the word at the position is W, co-occurrence of X needs +2, self-co-occurrence of W needs +1.比如stride=2,position处的单词是W,X的同现需要+2,W的自同需要+1。
X|Y|W|X|W
To update the m * m
matrix ( co_occurance_matrix ), currently accessing row by row with a loop.要更新m * m
矩阵 ( co_occurance_matrix ),当前使用循环逐行访问。 The entire code is at the bottom.整个代码在底部。
How can I remove the loop and update the multiple rows all at once?如何删除循环并一次更新多行? I believe there should be a way to combine each index into one matrix that replaces the loop with one vectorized update.我相信应该有一种方法可以将每个索引组合成一个矩阵,用一个矢量化更新替换循环。
Please advice possible approaches.请建议可能的方法。
for position in range(0, n):
co_ccurrence_matrix[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
] += 1
sequence
(word index is an integer code for each word).循环遍历单词索引sequence
数组(单词索引是每个单词的 integer 代码)。position
in the loop, check the co-occurring words on both sides within a stride
distance.对于循环中position
处的每个单词,检查stride
距内两侧的共现单词。context
window as in the purple box in the diagram .这是一个 N-gram context
window,如图中的紫色框所示。 N = context_size = stride*2 + 1
. N = context_size = stride*2 + 1
。co_occurrence_matrix
as per blue lines in the diagram .根据图中的蓝线co_occurrence_matrix
中每个共现词的计数。It seems the Integer array indexing may be a way to access multiple rows at the same time.似乎Integer 数组索引可能是同时访问多行的一种方式。
x = np.array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
rows = np.array([[0, 0],
[3, 3]], dtype=np.intp)
columns = np.array([[0, 2],
[0, 2]], dtype=np.intp)
x[rows, columns]
---
array([[ 0, 2],
[ 9, 11]])
Create a multi-dimensional indices by combining each index in the loop, but it does not work with the error.通过组合循环中的每个索引来创建多维索引,但它不适用于错误。 Please advise the cause and the mistakes, or if the attempt does not make sense.请告知原因和错误,或者如果尝试没有意义。
indices = np.array([
[
sequence[0], # position to the word
sequence[max(0, 0-stride) : min((0+stride),n-1) +1] # positions to co-occurrence words
]]
)
assert n > 1
for position in range(1, n):
co_occurrence_indices = np.array([
[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
]]
)
indices = np.append(
indices,
co_occurrence_indices,
axis=0
)
print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
indices,
indices.dtype
))
co_ccurrence_matrix[
indices <---- Error
] += 1
Updating the co_occurrence_matrix: indices
[[0 array([0, 1])]
[1 array([0, 1, 2])]
[2 array([1, 2, 3])]
[3 array([2, 3, 0])]
[0 array([3, 0, 1])]
[1 array([0, 1, 4])]
[4 array([1, 4, 5])]
[5 array([4, 5, 6])]
[6 array([5, 6, 7])]
[7 array([6, 7])]]
indices.dtype object
<ipython-input-88-d9b081bf2f1a>:48: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
indices = np.array([
<ipython-input-88-d9b081bf2f1a>:56: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
co_occurrence_indices = np.array([
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-88-d9b081bf2f1a> in <module>
84 sequence, word_to_id, id_to_word = preprocess(corpus)
85 vocabrary_size = max(word_to_id.values()) + 1
---> 86 create_cooccurrence_matrix(sequence, vocabrary_size , 3)
<ipython-input-88-d9b081bf2f1a> in create_cooccurrence_matrix(sequence, vocabrary_size, context_size)
70 indices.dtype
71 ))
---> 72 co_ccurrence_matrix[
73 indices
74 ] += 1
IndexError: arrays used as indices must be of integer (or boolean) type
import numpy as np
def preprocess(text):
"""
Args:
text: A string including sentences to process. corpus
Returns:
sequence:
A numpy array of word indices to every word in the original text as they appear in the text.
The objective of corpus is to preserve the original text but as numerical indices.
word_to_id: A dictionary to map a word to a word index
id_to_word: A dictionary to map a word index to a word
"""
text = text.lower()
text = text.replace('.', ' .')
words = text.split(' ')
word_to_id = {}
id_to_word = {}
for word in words:
if word not in word_to_id:
new_id = len(word_to_id)
word_to_id[word] = new_id
id_to_word[new_id] = word
sequence= np.array([word_to_id[w] for w in words])
return sequence, word_to_id, id_to_word
def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
"""
Args:
sequence: word index sequence of the original corpus text
vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
context_size: context (N-gram size N) within which to check co-occurrences.
"""
n = sequence_size = len(sequence)
co_ccurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
stride = int((context_size - 1)/2 )
assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
n, stride
)
for position in range(0, n):
co_ccurrence_matrix[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
] += 1
np.fill_diagonal(co_ccurrence_matrix, 0)
return co_ccurrence_matrix
corpus= "To be, or not to be, that is the question"
sequence, word_to_id, id_to_word = preprocess(corpus)
vocabrary_size = max(word_to_id.values()) + 1
create_cooccurrence_matrix(sequence, vocabrary_size , 3)
---
[[0 2 0 1 0 0 0 0]
[2 0 1 0 1 0 0 0]
[0 1 0 1 0 0 0 0]
[1 0 1 0 0 0 0 0]
[0 1 0 0 0 1 0 0]
[0 0 0 0 1 0 1 0]
[0 0 0 0 0 1 0 1]
[0 0 0 0 0 0 1 0]]
Used ptb.train.txt from enter link description here .从此处输入链接描述中使用了 ptb.train.txt。
Timer unit: 1e-06 s
Total time: 23.0015 s
File: <ipython-input-8-27f5e530d4ff>
Function: create_cooccurrence_matrix at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
2 """
3 Args:
4 sequence: word index sequence of the original corpus text
5 vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
6 context_size: context (N-gram size N) within to check co-occurrences.
7 Returns:
8 co_occurrence matrix
9 """
10 1 4.0 4.0 0.0 n = sequence_size = len(sequence)
11 1 98.0 98.0 0.0 co_occurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
12
13 1 5.0 5.0 0.0 stride = int((context_size - 1)/2 )
14 1 1.0 1.0 0.0 assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
15 n, stride
16 )
17
18 """
19 # Handle position=slice(0 : (stride-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
20 # Handle position=slice((n-1-stride) : (n-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
21 indices = [*range(0, (stride-1) +1), *range((n-1)-stride +1, (n-1) +1)]
22 #print(indices)
23
24 for position in indices:
25 debug(sequence, position, stride, False)
26 co_occurrence_matrix[
27 sequence[position], # position to the word
28 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # indices to co-occurance words
29 ] += 1
30
31
32 # Handle position=slice(stride, ((sequence_size-1) - stride) +1)
33 for position in range(stride, (sequence_size-1) - stride + 1):
34 co_occurrence_matrix[
35 sequence[position], # position to the word
36 sequence[(position-stride) : (position + stride + 1)] # indices to co-occurance words
37 ] += 1
38 """
39
40 929590 1175326.0 1.3 5.1 for position in range(0, n):
41 2788767 15304643.0 5.5 66.5 co_occurrence_matrix[
42 1859178 2176964.0 1.2 9.5 sequence[position], # position to the word
43 929589 3280181.0 3.5 14.3 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurance words
44 929589 1062613.0 1.1 4.6 ] += 1
45
46 1 1698.0 1698.0 0.0 np.fill_diagonal(co_occurrence_matrix, 0)
47
48 1 2.0 2.0 0.0 return co_occurrence_matrix
EDIT: You could do this using inbuilt sklearn functions extremely easily, but seeing the history of your questions, I believe you are looking for a pure NumPy vectorized implementation.编辑:您可以非常轻松地使用内置的 sklearn 函数来做到这一点,但是看到您的问题的历史,我相信您正在寻找一个纯 NumPy 矢量化实现。
IIUC, you want to create a co-occurrence matrix based on the context window around a word. IIUC,您想基于一个单词周围的上下文 window 创建一个共现矩阵。 So, if there are 12 words in a vocabulary, 100 sentences, and say a context size of 2, then you want to look at the rolling windows of 5 (2 left, 1 center, 2 right)
size in each of the sentences and iteratively (or vectorized) add the context words to get a (12, 12) matrix which tells you how many times a word occurs in the context window of another word .因此,如果词汇表中有 12 个单词,100 个句子,并且上下文大小为 2,那么您要查看每个句子中大小为 5 (2 left, 1 center, 2 right)
大小的滚动 windows 和迭代(或矢量化)添加上下文单词以获得 (12, 12)矩阵,该矩阵告诉您一个单词在另一个单词的上下文 window 中出现了多少次。
You can do this in a completely vectorized manner as such (explanation in last section) -您可以以完全矢量化的方式执行此操作(上一节中的说明)-
#Definitions
sentences, vocab, length, context_size = 100, 12, 15, 2
#Create dummy corpus (label encoded)
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length)) #(100, 15)
#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window #(100, 11, 5)
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1] #(120, 8, 8)
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride) #(100, 11, 5)
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size] #(100, 11)
context = np.delete(rolling_window, center_idx, -1) #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2) #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1])) #(12, 12)
np.fill_diagonal(cooccurence,0) #(12, 12)
print(cooccurence)
[[ 0. 94. 100. 114. 91. 92. 90. 128. 100. 114. 91. 84.]
[ 94. 0. 78. 96. 90. 65. 76. 68. 76. 108. 58. 68.]
[100. 78. 0. 125. 107. 93. 83. 84. 73. 84. 97. 110.]
[114. 96. 125. 0. 84. 97. 76. 110. 80. 94. 117. 97.]
[ 91. 90. 107. 84. 0. 84. 87. 103. 60. 127. 123. 97.]
[ 92. 65. 93. 97. 84. 0. 67. 87. 72. 87. 74. 92.]
[ 90. 76. 83. 76. 87. 67. 0. 83. 73. 118. 81. 108.]
[128. 68. 84. 110. 103. 87. 83. 0. 72. 100. 115. 69.]
[100. 76. 73. 80. 60. 72. 73. 72. 0. 83. 81. 100.]
[114. 108. 84. 94. 127. 87. 118. 100. 83. 0. 109. 110.]
[ 91. 58. 97. 117. 123. 74. 81. 115. 81. 109. 0. 104.]
[ 84. 68. 110. 97. 97. 92. 108. 69. 100. 110. 104. 0.]]
Let's test this on a single sentence corpus to be or not to be that is the question
让我们在一个句子语料库上测试这个to be or not to be that is the question
sentence = 'to be or not to be that is the question'
corpus = np.array([[0, 1, 2, 3, 0, 1, 4, 5, 6, 7]])
#Definitions
vocab, context_size = 8, 2
window = context_size*2+1
#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]
context = np.delete(rolling_window, center_idx, -1)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 5. 1. 3. 1. 2. 1. 0.]
[5. 0. 3. 2. 2. 1. 2. 1.]
[1. 3. 0. 1. 1. 0. 0. 0.]
[3. 2. 1. 0. 2. 1. 0. 0.]
[1. 2. 1. 2. 0. 1. 1. 1.]
[2. 1. 0. 1. 1. 0. 1. 0.]
[1. 2. 0. 0. 1. 1. 0. 1.]
[0. 1. 0. 0. 1. 0. 1. 0.]]
Let's start with creating some label encoded dummy data.让我们从创建一些 label 编码的虚拟数据开始。 Here there are 100
sentences, with a vocab of 12
size.这里有100
句子,有12
大小的词汇。 The length of each sentence is 15
and the window that I am taking is 5 (2+1+2)
-每个句子的长度是15
,我正在服用的 window 是5 (2+1+2)
-
sentences, vocab, length, context_size = 100, 12, 15, 2
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length))
corpus[0:2]
#top 2 sentences
array([[ 9, 8, 9, 4, 2, 10, 9, 0, 7, 1, 11, 0, 7, 3, 1],
[ 7, 9, 4, 0, 1, 9, 10, 7, 4, 2, 2, 3, 5, 8, 8]])
Next, we want to create rolling window views of the window size so that we can then get to our next stages.接下来,我们要创建 window 大小的滚动 window 视图,以便我们可以进入下一个阶段。 The shape of this new view would be equal to (sentences, number of windows, window size)
and so using stride_tricks
we can create a rolling window view of this matrix quite easily.这个新视图的形状将等于(sentences, number of windows, window size)
,因此使用stride_tricks
我们可以很容易地创建一个滚动的 window 矩阵视图。
#Create shape and stride definitions
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
print(shape, stride)
#create view
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride) #(100, 11, 5)
print('\nView for first sequence ->')
print(rolling_window[0])
(100, 11, 5) (120, 8, 8)
View for first sequence ->
[[ 9 8 9 4 2]
[ 8 9 4 2 10]
[ 9 4 2 10 9]
[ 4 2 10 9 0]
[ 2 10 9 0 7]
[10 9 0 7 1]
[ 9 0 7 1 11]
[ 0 7 1 11 0]
[ 7 1 11 0 7]
[ 1 11 0 7 3]
[11 0 7 3 1]]
Next let's look at only a single sentence first and get that into a co-occurance matrix.接下来让我们先看一个句子,然后将其放入共现矩阵。 After that we can scale it to a higher dimension matrix.之后,我们可以将其缩放到更高维度的矩阵。
For a SINGLE SENTENCE, we can do the following steps -对于 SINGLE SENTENCE,我们可以执行以下步骤 -
np.eye(vocab)
and filtering for the context labels使用np.eye(vocab)
创建一个单热矩阵并过滤上下文标签(word, word)
co-occurence matrix from the multi-hot context vectors for each window.取一个点积,从每个 window 的多热上下文向量中得到一个(word, word)
共现矩阵。position = rolling_window[0][:,2]
context = np.delete(rolling_window[0], 2, 1)
context_multihot = np.sum(np.eye(vocab)[context], axis=1)
cooccurence = context_multihot.T@context_multihot
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 3. 2. 1. 1. 0. 0. 5. 0. 2. 1. 4.]
[3. 0. 0. 2. 0. 0. 0. 4. 0. 2. 1. 3.]
[2. 0. 0. 0. 2. 0. 0. 1. 2. 3. 2. 0.]
[1. 2. 0. 0. 0. 0. 0. 1. 0. 0. 0. 2.]
[1. 0. 2. 0. 0. 0. 0. 0. 1. 4. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[5. 4. 1. 1. 0. 0. 0. 0. 0. 1. 2. 2.]
[0. 0. 2. 0. 1. 0. 0. 0. 0. 2. 1. 0.]
[2. 2. 3. 0. 4. 0. 0. 1. 2. 0. 4. 1.]
[1. 1. 2. 0. 1. 0. 0. 2. 1. 4. 0. 0.]
[4. 3. 0. 2. 0. 0. 0. 2. 0. 1. 0. 0.]]
We have now been able to do the whole thing for 1 sentence.我们现在已经能够用 1 个句子完成整个事情。 Now we just have to scale to 100 sentences without for loops.现在我们只需要在没有 for 循环的情况下扩展到 100 个句子。 For this only a few things need to change.为此,只需要改变几件事。
context_multihot
over last 2 axis before dot product在点积之前的最后 2 个轴上对context_multihot
进行转置np.dot
to np.tensordot
so that we can reduce specified axis.将np.tensordot
np.dot
我们可以减少指定的轴。 In this case, we have to perform (100, 12, 11) @ (100, 11, 12) -> (12, 12)
.在这种情况下,我们必须执行(100, 12, 11) @ (100, 11, 12) -> (12, 12)
。 So select axis accordingly.所以 select 轴相应。#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size] #(100, 11)
context = np.delete(rolling_window, center_idx, -1) #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2) #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1])) #(12, 12)
np.fill_diagonal(cooccurence,0) #(12, 12)
print(cooccurence)
[[ 0. 94. 100. 114. 91. 92. 90. 128. 100. 114. 91. 84.]
[ 94. 0. 78. 96. 90. 65. 76. 68. 76. 108. 58. 68.]
[100. 78. 0. 125. 107. 93. 83. 84. 73. 84. 97. 110.]
[114. 96. 125. 0. 84. 97. 76. 110. 80. 94. 117. 97.]
[ 91. 90. 107. 84. 0. 84. 87. 103. 60. 127. 123. 97.]
[ 92. 65. 93. 97. 84. 0. 67. 87. 72. 87. 74. 92.]
[ 90. 76. 83. 76. 87. 67. 0. 83. 73. 118. 81. 108.]
[128. 68. 84. 110. 103. 87. 83. 0. 72. 100. 115. 69.]
[100. 76. 73. 80. 60. 72. 73. 72. 0. 83. 81. 100.]
[114. 108. 84. 94. 127. 87. 118. 100. 83. 0. 109. 110.]
[ 91. 58. 97. 117. 123. 74. 81. 115. 81. 109. 0. 104.]
[ 84. 68. 110. 97. 97. 92. 108. 69. 100. 110. 104. 0.]]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.