简体   繁体   English

在python中创建一个词袋矢量的快速方法

[英]Fast way to create a bag-of-words vector in python

I have a corpus which has been sentence tokenized and word tokenized. 我有一个语料库,它被句子标记化并且被单词标记化。 Working in python, I took the 9,999 most common words and replaced out of vocabulary words with a special 'UNK', so that I have a vocabulary of 10,000 words and a python dictionary 'word_to_index' which maps each word to an integer. 在python中工作,我使用了9,999个最常用的单词并用特殊的'UNK'替换了词汇单词,因此我有一个10,000字的词汇表和一个python字典'word_to_index',它将每个单词映射到一个整数。

I would like a binary bag-of-words representation, where the representation of each of the original sentences is a 10,000 dimension numpy vector of 0s and 1s. 我想要一个二进制词袋表示,其中每个原始句子的表示是一个10,000维的numpy向量0s和1s。 If a word i from the vocabulary is in the sentence, the index[ i ] in the numpy array will be a 1; 如果词汇表中的单词i在句子中,则numpy数组中的索引[ i ]将为1; otherwise, a 0. Until now, I've been using the following code: 否则,一个0.到现在为止,我一直在使用以下代码:

def bag_of_words(sent, vocab_length, word_to_index):
    words = []
    rep = np.zeros(vocab_length)
    for w in sent:
        if w not in words:
            rep += np.eye(vocab_length)[word_to_index[w]]
            words.append(w)
    return rep

def get_bag_of_words_corpus(corpus, vocab_length, word_to_index):
    return np.array([bag_of_words(sent, vocab_length, word_to_index) for sent in corpus])

The problem is that for each sentence, it takes nearly 1 second to create the numpy vector. 问题是,对于每个句子,创建numpy向量需要将近1秒。 Seeing as my corpus is 12.2 M sentences I'd rather not wait ~4.7 months that it would take to process it. 看到我的语料库是12.2 M的句子,我宁愿不等待~4.7个月来处理它。 Can anyone give me any advice on speeding up this code. 任何人都可以给我任何关于加快这段代码的建议。 I thought about trying a smarter hashing technique, but I'm not sure that will give me the improvement I'm looking for. 我想过尝试一种更聪明的哈希技术,但我不确定这会给我带来的改进我正在寻找。

Why are you creating a complete eye array? 你为什么要创建一个完整的眼睛阵列?

Simply do 简单地做

for w in sent:
    if w not in words:
        ind=word_to_index[w]
        rep[ind]+=1
        #rep += np.eye(vocab_length)[word_to_index[w]]
        words.append(w)

You can also try casting a sentence to sets.Set to eliminate duplicates. 您也可以尝试将句子转换为sets.Set以消除重复。 You should also use sets.Set for words, since the in function runs in O(1) if you are using Set. 您还应该使用sets.Set作为单词,因为如果您使用Set,则in函数在O(1)运行。

Source 资源

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM