简体   繁体   English

如何计算句子中的单词数,忽略数字、标点符号和空格?

[英]How to count the number of words in a sentence, ignoring numbers, punctuation and whitespace?

How would I go about counting the words in a sentence?我将如何计算句子中的单词? I'm using Python.我正在使用 Python。

For example, I might have the string:例如,我可能有这样的字符串:

string = "I     am having  a   very  nice  23!@$      day. "

That would be 7 words.那将是7个字。 I'm having trouble with the random amount of spaces after/before each word as well as when numbers or symbols are involved.我在每个单词之后/之前以及涉及数字或符号时遇到随机数量的空格问题。

str.split() without any arguments splits on runs of whitespace characters:不带任何参数的str.split()在空白字符运行时拆分:

>>> s = 'I am having a very nice day.'
>>> 
>>> len(s.split())
7

From the linked documentation:从链接的文档:

If sep is not specified or is None , a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.如果sep未指定或为None ,则应用不同的拆分算法:连续空格的运行被视为单个分隔符,如果字符串具有前导或尾随空格,则结果将在开头或结尾不包含空字符串。

You can use regex.findall() :您可以使用regex.findall()

import re
line = " I am having a very nice day."
count = len(re.findall(r'\w+', line))
print (count)
s = "I     am having  a   very  nice  23!@$      day. "
sum([i.strip(string.punctuation).isalpha() for i in s.split()])

The statement above will go through each chunk of text and remove punctuations before verifying if the chunk is really string of alphabets.上面的语句将遍历每个文本块并在验证块是否真的是字母串之前删除标点符号。

This is a simple word counter using regex.这是一个使用正则表达式的简单单词计数器。 The script includes a loop which you can terminate it when you're done.该脚本包含一个循环,您可以在完成后终止它。

#word counter using regex
import re
while True:
    string =raw_input("Enter the string: ")
    count = len(re.findall("[a-zA-Z_]+", string))
    if line == "Done": #command to terminate the loop
        break
    print (count)
print ("Terminated")
    def wordCount(mystring):  
        tempcount = 0  
        count = 1  

        try:  
            for character in mystring:  
                if character == " ":  
                    tempcount +=1  
                    if tempcount ==1:  
                        count +=1  

                    else:  
                        tempcount +=1
                 else:
                     tempcount=0

             return count  

         except Exception:  
             error = "Not a string"  
             return error  

    mystring = "I   am having   a    very nice 23!@$      day."           

    print(wordCount(mystring))  

output is 8输出为 8

How about using a simple loop to count the occurrences of number of spaces!?使用一个简单的循环来计算空格数的出现如何!?

 txt = "Just an example here move along" count = 1 for i in txt: if i == " ": count += 1 print(count)

Ok here is my version of doing this.好的,这是我这样做的版本。 I noticed that you want your output to be 7 , which means you dont want to count special characters and numbers.我注意到您希望输出为7 ,这意味着您不想计算特殊字符和数字。 So here is regex pattern:所以这是正则表达式模式:

re.findall("[a-zA-Z_]+", string)

Where [a-zA-Z_] means it will match any character beetwen az (lowercase) and AZ (upper case).其中[a-zA-Z_]表示它将匹配任何字符 beetwen az (小写)和AZ (大写)。


About spaces.关于空格。 If you want to remove all extra spaces, just do:如果要删除所有多余的空格,只需执行以下操作:

string = string.rstrip().lstrip() # Remove all extra spaces at the start and at the end of the string
while "  " in string: # While  there are 2 spaces beetwen words in our string...
    string = string.replace("  ", " ") # ... replace them by one space!
import string 

sentence = "I     am having  a   very  nice  23!@$      day. "
# Remove all punctuations
sentence = sentence.translate(str.maketrans('', '', string.punctuation))
# Remove all numbers"
sentence = ''.join([word for word in sentence if not word.isdigit()])
count = 0;
for index in range(len(sentence)-1) :
    if sentence[index+1].isspace() and not sentence[index].isspace():
        count += 1 
print(count)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM