简体   繁体   中英

Count total number of words in a file?

I want to find the total number of words in a file (text/string). I was able to get an output with my code but I'm not sure if it is correct Here are some sample files for y'all to try and see what you get. Also note, use of any modules/libraries is not permitted. sample1: https://www.dropbox.com/s/kqwvudflxnmldqr/sample1.txt?dl=0

sample2 - https://www.dropbox.com/s/7xph5pb9bdf551h/sample2.txt?dl=0

sample3 - https://www.dropbox.com/s/4mdb5hgnxyy5n2p/sample3.txt?dl=0

There are some things you must consider before counting the words.

  1. A sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character). Eg if a full-stop is not at the end of a sentence, it is to be regarded as white space, so serve to end words. Like 3.42 would be two words. Or P.yth.on would be 3 words.

  2. Double hypen (--) represents is to be regarded as a space character. That being said, first of all, I opened and read the file to get all the text. I then replaced all the useless characters with blank space so it is easier to count the words. This includes '--' as well.

Then I split the text into words, created a dictionary to store count of the words. After completing the dictionary, I added all the values to get the total number of words and printed this. See below for code:

def countwords():
    filename = input("Name of file? ")
    text = open(filename, "r").read()
    text = text.lower() 
    for ch in '!.?"#$%&()*+/:<=>@[\\]^_`{|}~': 
        text = text.replace(ch, ' ')
    text = text.replace('--', ' ')          
    text = text.rstrip("\n")    
    words = text.split()       
    count = {}                 
    for w in words:
        count[w] = count.get(w,0) + 1   
    wordcount = sum(count.values())     
    print(wordcount)

So for sample1 text file, my word count is 321, Forsample2: 542 For sample3: 139
I was hoping if I could compare these answers with some python pros here and see if my results are correct and if they are not what I'm doing wrong.

You can try this solution using regex.

#word counter using regex
import re
while True:
    string =raw_input("Enter the string: ")
    count = len(re.findall("[a-zA-Z_]+", string))
    if line == "Done": #command to terminate the loop
        break
    print (count)
print ("Terminated")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM