how to convert multiple sentences into bigram in python

Question

I'm fairly new to python and I would like to convert an array of sentences to bigrams, is there a way to do this? for example

X = ['I like u', 'u like me', ...]

If ngram = 2 I'm expecting the vocabulary has something like

[0: 'I ',
 1: ' l',
 2: 'li',
 3: 'ik',
 4: 'ke',
 5: 'e ',
 6: ' u',
 7: 'u ',
 8: ' m',
 9: 'me'...]

so X can be converted to

 X_conv = [ '0, 1, 2, 3, 4, 5, 6',
            '7, 1, 2, 3, 4, 5, 8, 9',....]

Is there an functionI can do with countvectorizer?

Answer 1

Say, you have the function ngrams :

def ngrams(text, n=2):
    return [text[i:i+n] for i in range(len(text)-n+1)]

now applying this to all elements to a list is rather easy:

>>> sentences = ['I like u', 'u like me']
>>> processed = [ngrams(sentence, n=2) for sentence in sentences]
>>> processed
[['I ', ' l', 'li', 'ik', 'ke', 'e ', ' u'], 
 ['u ', ' l', 'li', 'ik', 'ke', 'e ', ' m', 'me']]

So that is rather easy. To number the ngrams, you could build nested for loops, but it wouldn't look nice.

Instead we can use a trick: collections.defaultdict , which will create a new item if it doesn't exist when it is first accessed. We couple this with itertools.count() which returns a iterable counter. The __next__ magic method is a callable that when called the first time returns the first number, then the second and so forth. defaultdict will call this method once per each new item

from collections import defaultdict
from itertools import count

reverse_vocabulary = defaultdict(count().__next__)
numbered = [[reverse_vocabulary[ngram] for ngram in sentence]
            for sentence in processed]
print(numbered)
# [[0, 1, 2, 3, 4, 5, 6], [7, 1, 2, 3, 4, 5, 8, 9]]

Now the reverse vocabulary is the opposite of what you'd want:

defaultdict(<...>, {' m': 8, ' u': 6, 'I ': 0, 'li': 2, 'u ': 7, 'e ': 5, 'ke': 4, 'ik': 3, 
                    ' l': 1, 'me': 9})

We make an ordinary dictionary of it by inverting the mapping :

vocabulary = {number: ngram for ngram, number in reverse_vocabulary.items()}

which results in vocabulary being an ordinary dictionary

{0: 'I ', 1: ' l', 2: 'li', 3: 'ik', 4: 'ke', 5: 'e ', 6: ' u', 7: 'u ', 8: ' m', 9: 'me'}

how to convert multiple sentences into bigram in python

Question

1 answers

solution1
1 ACCPTED 2017-10-08 07:13:04

how to convert multiple sentences into bigram in python

Question

1 answers

solution1 1 ACCPTED 2017-10-08 07:13:04

solution1
1 ACCPTED 2017-10-08 07:13:04