I'm trying to count the number of words in a string. however, i first have to strip some punctuations eg
line = "i want you , to know , my name . "
running
en = line.translate(string.maketrans('', ''), '!,.?')
produces
en = "i want you to know my name "
after this, i want to count the number of words in the line. but when i do len(en) I get 30 instead of 7.
Using split on en to tokenize and find the length doesn't work in all cases. eg
i tried that it doesn't always work. eg consider this string.
"i ccc bcc the a of the abc ccc dd on aaa , 28 abc 19 ."
en then becomes:
"i ccc bcc the a of the abc ccc dd on aaa 28 abc 19 "
but len(en) returns 17 and not 15.
can you please help? thanks
The problem with en.split(' ')
is that you have extra whitespace in your string, which gives empty matches. You could fix this quite easily by calling en.split()
instead.
But perhaps you could use this different approach using a regular expression (and now there is no need to remove the punctuation first):
import re
print len(re.findall(r'\w+', line))
See it working online: ideone
Instead of using the regex \\w+
it is much faster to use \\b
for counting words, like so:
import re
_re_word_boundaries = re.compile(r'\b')
def num_words(line):
return len(_re_word_boundaries.findall(line)) >> 1
Note that we have to halve the number because \\b
matches at both the beginning and the end of a word. Unfortunately, unlike egrep, Python does not support matching at only the beginning or the end.
If you have very long lines and are concerned about memory, using an iterator may be a better solution:
def num_words(line):
return sum(1 for word in _re_word_boundaries.finditer(line)) >> 1
You can use NLTK :
import nltk
en = "i ccc bcc the a of the abc ccc dd on aaa 28 abc 19 "
print(len(nltk.word_tokenize(en)))
Output:
15
def main():
# get the user msg
print "this program tells you how many words are in your sentence."
message = raw_input("Enter message: ")
wrdcount = 0
for i in message.split():
eawrdlen = len(i) / len(i)
wrdcount = wrdcount + eawrdlen
print wrdcount
main()
The len function counts the length of the variable, which in this case, is the length of the string, which is 30 characters. To count words, you'll need to split the string on whitespace, and then count the number of items which are returned.
Take a look at the introductory example in the docs for collections.Counter . That shows how to find individual words in a sentence.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.