[英]Fast way to create a bag-of-words vector in python
I have a corpus which has been sentence tokenized and word tokenized. 我有一个语料库,它被句子标记化并且被单词标记化。 Working in python, I took the 9,999 most common words and replaced out of vocabulary words with a special 'UNK', so that I have a vocabulary of 10,000 words and a python dictionary 'word_to_index' which maps each word to an integer.
在python中工作,我使用了9,999个最常用的单词并用特殊的'UNK'替换了词汇单词,因此我有一个10,000字的词汇表和一个python字典'word_to_index',它将每个单词映射到一个整数。
I would like a binary bag-of-words representation, where the representation of each of the original sentences is a 10,000 dimension numpy vector of 0s and 1s. 我想要一个二进制词袋表示,其中每个原始句子的表示是一个10,000维的numpy向量0s和1s。 If a word i from the vocabulary is in the sentence, the index[ i ] in the numpy array will be a 1;
如果词汇表中的单词i在句子中,则numpy数组中的索引[ i ]将为1; otherwise, a 0. Until now, I've been using the following code:
否则,一个0.到现在为止,我一直在使用以下代码:
def bag_of_words(sent, vocab_length, word_to_index):
words = []
rep = np.zeros(vocab_length)
for w in sent:
if w not in words:
rep += np.eye(vocab_length)[word_to_index[w]]
words.append(w)
return rep
def get_bag_of_words_corpus(corpus, vocab_length, word_to_index):
return np.array([bag_of_words(sent, vocab_length, word_to_index) for sent in corpus])
The problem is that for each sentence, it takes nearly 1 second to create the numpy vector. 问题是,对于每个句子,创建numpy向量需要将近1秒。 Seeing as my corpus is 12.2 M sentences I'd rather not wait ~4.7 months that it would take to process it.
看到我的语料库是12.2 M的句子,我宁愿不等待~4.7个月来处理它。 Can anyone give me any advice on speeding up this code.
任何人都可以给我任何关于加快这段代码的建议。 I thought about trying a smarter hashing technique, but I'm not sure that will give me the improvement I'm looking for.
我想过尝试一种更聪明的哈希技术,但我不确定这会给我带来的改进我正在寻找。
Why are you creating a complete eye array? 你为什么要创建一个完整的眼睛阵列?
Simply do 简单地做
for w in sent:
if w not in words:
ind=word_to_index[w]
rep[ind]+=1
#rep += np.eye(vocab_length)[word_to_index[w]]
words.append(w)
You can also try casting a sentence to sets.Set to eliminate duplicates. 您也可以尝试将句子转换为sets.Set以消除重复。 You should also use sets.Set for words, since the
in
function runs in O(1)
if you are using Set. 您还应该使用sets.Set作为单词,因为如果您使用Set,则
in
函数在O(1)
运行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.