简体   繁体   中英

How do the count the number of sentences, words and characters in a file?

I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.

>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
...   words=nltk.tokenize.word_tokenize(each_sentence)
...   print each_sentence   #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
...   words=nltk.tokenize.word_tokenize(each_word)
...   print each_words      #prints tokenized words from samp.txt

Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath ):

import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')

print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])

Hope this helps

For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.

import textstat

your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))

With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1 )

And in your case:

import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()
  • Characters are easy to count.
  • Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
  • Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
  • Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.

For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.

Not 100% correct but I just gave a try. I have not taken all points by @wilhelmtell in to consideration. I try them once I have time...

if __name__ == "__main__":
   f = open("1.txt")
   c=w=0
   s=1
   prevIsSentence = False
   for x in f:
      x = x.strip()
      if x != "":
        words = x.split()
        w = w+len(words)
        c = c + sum([len(word) for word in words])
        prevIsSentence = True
      else:
        if prevIsSentence:
           s = s+1
        prevIsSentence = False

   if not prevIsSentence:
      s = s-1
   print "%d:%d:%d" % (c,w,s)

Here 1.txt is the file name.

The only way you can solve this is by creating an AI program that uses N atural L anguage P rocessing which is not very easy to do.

Input:

"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."

Checkout OpenNLP

https://sourceforge.net/projects/opennlp/

http://opennlp.apache.org/

I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence

len(re.findall(r"[^?!.][?!.]", paragraph))

已经有一个程序来计算单词和字符—— wc

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM