简体   繁体   English

如何在python中将多个句子转换成bigram

[英]how to convert multiple sentences into bigram in python

I'm fairly new to python and I would like to convert an array of sentences to bigrams, is there a way to do this? 我是python的新手,我想将句子数组转换为bigrams,有没有办法做到这一点? for example 例如

X = ['I like u', 'u like me', ...]

If ngram = 2 I'm expecting the vocabulary has something like 如果ngram = 2,我期望词汇量像

[0: 'I ',
 1: ' l',
 2: 'li',
 3: 'ik',
 4: 'ke',
 5: 'e ',
 6: ' u',
 7: 'u ',
 8: ' m',
 9: 'me'...]

so X can be converted to 所以X可以转换成

 X_conv = [ '0, 1, 2, 3, 4, 5, 6',
            '7, 1, 2, 3, 4, 5, 8, 9',....]

Is there an functionI can do with countvectorizer? 我可以使用countvectorizer做一个功能吗?

Say, you have the function ngrams : 假设您具有ngrams函数:

def ngrams(text, n=2):
    return [text[i:i+n] for i in range(len(text)-n+1)]

now applying this to all elements to a list is rather easy: 现在将其应用于列表的所有元素非常简单:

>>> sentences = ['I like u', 'u like me']
>>> processed = [ngrams(sentence, n=2) for sentence in sentences]
>>> processed
[['I ', ' l', 'li', 'ik', 'ke', 'e ', ' u'], 
 ['u ', ' l', 'li', 'ik', 'ke', 'e ', ' m', 'me']]

So that is rather easy. 所以这很容易。 To number the ngrams, you could build nested for loops, but it wouldn't look nice. 要对ngram进行编号,您可以构建嵌套的for循环,但是看起来不太好。

Instead we can use a trick: collections.defaultdict , which will create a new item if it doesn't exist when it is first accessed. 取而代之的是,我们可以使用一个技巧: collections.defaultdict ,如果初次访问时不存在新项目,它将创建一个新项目。 We couple this with itertools.count() which returns a iterable counter. 我们将其与itertools.count() ,后者返回一个可迭代的计数器。 The __next__ magic method is a callable that when called the first time returns the first number, then the second and so forth. __next__ magic方法是可调用的,它在第一次调用时返回第一个数字,然后返回第二个,依此类推。 defaultdict will call this method once per each new item defaultdict将为每个新项目调用一次此方法

from collections import defaultdict
from itertools import count

reverse_vocabulary = defaultdict(count().__next__)
numbered = [[reverse_vocabulary[ngram] for ngram in sentence]
            for sentence in processed]
print(numbered)
# [[0, 1, 2, 3, 4, 5, 6], [7, 1, 2, 3, 4, 5, 8, 9]]

Now the reverse vocabulary is the opposite of what you'd want: 现在,反向词汇与您想要的相反:

defaultdict(<...>, {' m': 8, ' u': 6, 'I ': 0, 'li': 2, 'u ': 7, 'e ': 5, 'ke': 4, 'ik': 3, 
                    ' l': 1, 'me': 9})

We make an ordinary dictionary of it by inverting the mapping : 我们通过反转映射来制作一个普通的字典:

vocabulary = {number: ngram for ngram, number in reverse_vocabulary.items()}

which results in vocabulary being an ordinary dictionary 导致词汇成为普通词典

{0: 'I ', 1: ' l', 2: 'li', 3: 'ik', 4: 'ke', 5: 'e ', 6: ' u', 7: 'u ', 8: ' m', 9: 'me'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM