简体   繁体   English

如何计算文本文件句子中的字符数?

[英]How to calculate the number of characters in sentence of a text file?

I want to split a text into sentences and then print the number of characters of each sentence, but the program does not calculate the number of characters in each sentence. 我想将文本拆分为句子,然后打印每个句子的字符数,但是该程序无法计算每个句子的字符数。

I have tried to tokenize the file entered by the user into sentences and loop through the sentences counting and printing the number of characters in each. 我试图将用户输入的文件标记为句子,并循环计算句子并打印每个句子中的字符数。 The code I've tried is: 我尝试过的代码是:

from collections import defaultdict
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize,wordpunct_tokenize
import re
import os
import sys
from pathlib import Path

while True:
    try:
        file_to_open =Path(input("\nYOU SELECTED OPTION 8: 
            CALCULATE SENTENCE LENGTH. Please, insert your file 
path: "))
        with open(file_to_open,'r', encoding="utf-8") as f:
            words = sent_tokenize(f.read())
            break
    except FileNotFoundError:
        print("\nFile not found. Better try again")
    except IsADirectoryError:
        print("\nIncorrect Directory path.Try again")


print('\n\n This file contains',len(words),'sentences in total')



wordcounts = []
caracter_count=0
sent_number=1
with open(file_to_open) as f:
    text = f.read()
    sentences = sent_tokenize(text)
    for sentence in sentences:
        if sentence.isspace() !=True:
            caracter_count = caracter_count + 1
            print("Sentence", sent_number,'contains',caracter_count, 
'characters')
            sent_number +=1
            caracter_count = caracter_count + 1

I WANT TO PRINT SOMETHING LIKE: 我想打印一些东西:

" SENTENCE 1 HAS 35 CHARACTERS" " SENTENCE 2 HAS 45 CHARACTERS" “句子1具有35个字符”“句子2具有45个字符”

and so on.... 等等....

The output that I'm getting with this program is: This file contains 4 sentences in total "Sentence 1 contains 0 characters" "Sentence 2 contains 1 characters" "Sentence 3 contains 2 characters" "Sentence 4 contains 3 characters" 我通过该程序得到的输出是:该文件总共包含4个句子“句子1包含0个字符”“句子2包含1个字符”“句子3包含2个字符”“句子4包含3个字符”

Anyone could help me to do that? 任何人都可以帮助我做到这一点吗?

You're not counting the number of characters in your sentence with caracter_count. 您没有使用caracter_count来计算句子中的字符数。 I think that changing your for loop into : 我认为将您的for循环更改为:

sentence_number = 1
for sentence in sentences:
    if not sentence.isspace():
        print("Sentence {} contains {} characters".format(sentence_number, len(sentence))
        sentence_number += 1

will work fine 会很好的工作

Well your question seems to be interesting, this problem has a simple solution. 您的问题似乎很有趣,这个问题有一个简单的解决方案。 Remember for the first run use this command "nltk.download('punkt')" after first run just comment it out. 请记住,对于第一次运行,请在第一次运行后使用此命令“ nltk.download('punkt')”将其注释掉。

import nltk
#nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def count_lines(file):
    count=0
    myfile=open(file,"r")
    string = ""

    for line in myfile:
        string+=line  
        print(string)

    number_of_sentences = sent_tokenize(string)

    for w in number_of_sentences:
        count+=1
        print("Sentence ",count,"has ",len(w),"words")

count_lines("D:\Atharva\demo.txt")

OUTPUT: 输出:

What is Python language?Python is a widely used high-level, general-purpose, interpreted, dynamic programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than possible in languages such as C++ or Java. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles. It features a dynamic type system and automatic memory management and has a large and comprehensive standard library. The best way we learn anything is by practice and exercise questions. We  have started this section for those (beginner to intermediate) who are familiar with Python.  
Sentence  1 has  119 words
Sentence  2 has  175 words
Sentence  3 has  134 words
Sentence  4 has  117 words
Sentence  5 has  69 words
Sentence  6 has  95 words

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM