简体   繁体   中英

Bag of words model using NLTK

I am trying to implement Bag of Words model, unable to get it right in below code

words_to_index={'hi': 0, 'you': 1, 'me': 2, 'are': 3}
ex=["hi how are you"]
Z=ex.split(" ")
ans=[[1,1,0,1]]
res=np.zeros(40)
for i in range(0,len(ex)+1):
    for key,val in words_to_index.items():
        if Z[i]==key:
            res[words_to_index[key]]=res[words_to_index[key]]+1
print(res)

Getting this error - AttributeError: 'list' object has no attribute 'split'

Your code contains a number of bugs and inefficiencies.

Before you proceed, perhaps spend a moment to figure out how to get your program to tell when an assumption of yours might not be correct. A good place to start is to add this after the assignment of ex :

print('ex is a {0}: {1!r}'.format(type(ex), ex))

which prints out the type of the variable, and its value. With that in place, you will easily spot the problem

ex is a <class 'list'>: ['hi how are you']

A slightly more advanced technique is to use logging , which allows you to easily disable the diagnostic messages when your code is working, and later enable them again if you want to make changes to your code and see that it still does what it's supposed to do.

import logging

logging.basicConfig(level=logging.DEBUG)

# ...
logging.debug('ex is a {0}: {1!r}'.format(type(ex), ex)))

When you are done debugging, simply change the logging.basicConfig() to say level=logging.WARN , which will disable the display of all logging.debug() and logging.info() output. See the documentation for details.

Another useful debugging aid is assert :

assert isinstance(str, ex), 'ex is not a str: {0) ({1!r})'.format(type(ex), ex))

See the Python Wiki for some guidance. Notice that assert statements can be disabled eg when you enable optimization of your Python code, so you should perhaps add explicit checks instead, or as well, in your code.

if not isinstance(str, ex):
    raise TypeError('ex must be a str, not {0} ({1!r})'.format(type(ex), ex)))

Now, with that out of the way, here is a refactored version of your script with what I think you were trying to do.

#!/usr/bin/env python

import numpy as np
import logging

logging.basicConfig(level=logging.DEBUG, format='%(module)s:%(asctime)s:%(message)s')

words_to_index={'hi': 0, 'you': 1, 'me': 2, 'are': 3}
ex = "hi how are you"                   # single string, not list of strings
#print('ex is {0} (type {1})'.format(ex, type(ex)))
logging.debug('ex is {0} (type {1})'.format(ex, type(ex)))
assert isinstance(ex, str), 'ex should be a string (is {0} {1!r})'.format(type(ex), ex)
Z=ex.split(" ")                         # maybe choose a more descriptive variable name
#ans=[[1,1,0,1]]                        # never used, commented out
res=np.zeros(40)
#for i in range(0,len(ex)+1):           # Looping over the wrong thing
for word in Z:
    logging.debug('word is {0}'.format(word))
    if word in words_to_index:          # words_to_index is a hash; no need to loop
        logging.debug('{0} found in {1}'.format(word, Z))
        res[words_to_index[word]] += 1  # notice increment syntax
        logging.debug('res[{0}] = {1}'.format(words_to_index[word], res[words_to_index[word]]))
print(res)

Of course, this is not using NLTK at all; the NLTK library contains a somewhat more advanced set of functions which already do some of this for you, starting with proper NLP tokenization etc, but does not actually contain a TF component. Perhaps start with Does NLTK have TF-IDF implemented? which has pointers to some existing implementations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM