简体   繁体   中英

how to convert multiple sentences into bigram in python

I'm fairly new to python and I would like to convert an array of sentences to bigrams, is there a way to do this? for example

X = ['I like u', 'u like me', ...]

If ngram = 2 I'm expecting the vocabulary has something like

[0: 'I ',
 1: ' l',
 2: 'li',
 3: 'ik',
 4: 'ke',
 5: 'e ',
 6: ' u',
 7: 'u ',
 8: ' m',
 9: 'me'...]

so X can be converted to

 X_conv = [ '0, 1, 2, 3, 4, 5, 6',
            '7, 1, 2, 3, 4, 5, 8, 9',....]

Is there an functionI can do with countvectorizer?

Say, you have the function ngrams :

def ngrams(text, n=2):
    return [text[i:i+n] for i in range(len(text)-n+1)]

now applying this to all elements to a list is rather easy:

>>> sentences = ['I like u', 'u like me']
>>> processed = [ngrams(sentence, n=2) for sentence in sentences]
>>> processed
[['I ', ' l', 'li', 'ik', 'ke', 'e ', ' u'], 
 ['u ', ' l', 'li', 'ik', 'ke', 'e ', ' m', 'me']]

So that is rather easy. To number the ngrams, you could build nested for loops, but it wouldn't look nice.

Instead we can use a trick: collections.defaultdict , which will create a new item if it doesn't exist when it is first accessed. We couple this with itertools.count() which returns a iterable counter. The __next__ magic method is a callable that when called the first time returns the first number, then the second and so forth. defaultdict will call this method once per each new item

from collections import defaultdict
from itertools import count

reverse_vocabulary = defaultdict(count().__next__)
numbered = [[reverse_vocabulary[ngram] for ngram in sentence]
            for sentence in processed]
print(numbered)
# [[0, 1, 2, 3, 4, 5, 6], [7, 1, 2, 3, 4, 5, 8, 9]]

Now the reverse vocabulary is the opposite of what you'd want:

defaultdict(<...>, {' m': 8, ' u': 6, 'I ': 0, 'li': 2, 'u ': 7, 'e ': 5, 'ke': 4, 'ik': 3, 
                    ' l': 1, 'me': 9})

We make an ordinary dictionary of it by inverting the mapping :

vocabulary = {number: ngram for ngram, number in reverse_vocabulary.items()}

which results in vocabulary being an ordinary dictionary

{0: 'I ', 1: ' l', 2: 'li', 3: 'ik', 4: 'ke', 5: 'e ', 6: ' u', 7: 'u ', 8: ' m', 9: 'me'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM