简体   繁体   中英

Can't Import Bigrams From NLTK Library

quick question that is confusing me. I have NLTK installed and it has been working fine. However I am trying to get bigrams of a corpus and want to use bigrams(corpus) basically.. but it says that bigrams is not defined when i "from nltk import bigrams"

Same with trigrams. Am I missing something? Also, how could I get bigrams from the corpus manually.

I am also looking to calculate the frequencies of bigrams trigrams and quads, but am unsure exactly how to go about this.

I have the corpus tokenized with "<s>" and "</s>" at the beginning and end appropriately. Program so far here:

 #!/usr/bin/env python
import re
import nltk
import nltk.corpus as corpus
import tokenize
from nltk.corpus import brown

def alter_list(row):
    if row[-1] == '.':
        row[-1] = '</s>'
    else:
        row.append('</s>')
    return ['<s>'] + row

news = corpus.brown.sents(categories = 'editorial')
print len(news),'\n'

x = len(news)
for row in news[:x]:
    print(alter_list(row))

I tested this in a virtualenv and it works:

In [20]: from nltk import bigrams

In [21]: bigrams('This is a test')
Out[21]: 
[('T', 'h'),
 ('h', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'a'),
 ('a', ' '),
 (' ', 't'),
 ('t', 'e'),
 ('e', 's'),
 ('s', 't')]

Is that the only error you're getting?

By the way, as for your second question:

from collections import Counter
In [44]: b = bigrams('This is a test')

In [45]: Counter(b)
Out[45]: Counter({('i', 's'): 2, ('s', ' '): 2, ('a', ' '): 1, (' ', 't'): 1, ('e', 's'): 1, ('h', 'i'): 1, ('t', 'e'): 1, ('T', 'h'): 1, (' ', 'i'): 1, (' ', 'a'): 1, ('s', 't'): 1})

For words:

In [49]: b = bigrams("This is a test".split(' '))

In [50]: b
Out[50]: [('This', 'is'), ('is', 'a'), ('a', 'test')]

In [51]: Counter(b)
Out[51]: Counter({('is', 'a'): 1, ('a', 'test'): 1, ('This', 'is'): 1})

This split by words obviously is very superficial but depending on your application it may suffice. Obviously you could use nltk's tokenize which is far more sophisticated.

In order to accomplish your final goal, you can do something like that:

In [56]: d = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

In [56]: from nltk import trigrams
In [57]: tri = trigrams(d.split(' '))

In [60]: counter = Counter(tri)

In [61]: import random

In [62]: random.sample(counter, 5)
Out[62]: 
[('Ipsum', 'has', 'been'),
 ('industry.', 'Lorem', 'Ipsum'),
 ('Ipsum', 'passages,', 'and'),
 ('was', 'popularised', 'in'),
 ('galley', 'of', 'type')]

I trimmed the output because it was unnecessarily large, but you get the idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM