I need to find the number of words in a file. Any sequence of alphanumeric characters with a length >= 1 and with the beginning and ending non-alphanumeric character removed counts as a word.
Here is the code I have so far:
num_words = 0
textfile = open('gettysburg.txt', 'r').read()
words = textfile.split()
for word in words:
if len(word) >= 1:
num_words +=1
print(num_words)
The counter gives me 268, but there are 271 words in the text. There are four words that are separated by dashes or "--" which are being counted as 2 words. How do I strip the non-letter characters to display these 4 words?
I don't think you want to strip the hyphens, you just want them noted as characters that can make a word. You might use a regular expression.
re.findall('[\w\-]+', 'words in sentence. some hyphenated-together.')
gives
['words', 'in', 'sentence', 'some', 'hyphenated-together']
Hey you are incredibly close.
The string.split()
function takes a parameter str
which by default is white-space. You can also change the letter that the string should be split by.
num_words = 0
textfile = open('gettysburg.txt', 'r').read()
words = textfile.split()
for word in words:
count = len(word.split(str = "-"))
num_words += count
print(num_words)
Python Tutorials has a nice description about the function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.