简体   繁体   English

Python计算拆分句子的单词?

[英]Python count words of split sentence?

Not sure how to remove the "\\n" thing at the end of output不确定如何删除输出末尾的“\\n”

Basically, i have this txt file with sentences such as:基本上,我有这个 txt 文件,其中包含以下句子:

"What does Bessie say I have done?" I asked.

"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child 
 taking up her elders in that manner.
 
Be seated somewhere; and until you can speak pleasantly, remain silent."

I managed to split the sentences by semicolon with code:我设法用分号用代码分割句子:

import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
    low = word.lower()
    re.split(';',low)

But not sure how to count the words of the split sentences as len() doesn't work: The output of the sentences:但不确定如何计算拆分句子的单词,因为 len() 不起作用:句子的输出:

['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a 
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']

The third sentence for example, i am trying to count the 3 words at left and 8 words at right.例如,第三句话,我正在尝试计算左侧的 3 个单词和右侧的 8 个单词。

Thanks for reading!谢谢阅读!

The number of words is the number of spaces plus one:字数是空格数加一:

eg Two spaces, three words:例如两个空格,三个词:

World is wonderful世界很美好

Code:代码:

import re
import string

lines = []
with open('file.txt', 'r') as f:
    lines = f.readlines()

DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
    # Remove empty sentance
    if not sentence.strip():
        continue
    # Remove punctuation besides our delimiter ';'
    sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
    # Split by our delimeter
    splitted = re.split(DELIMETER, sentence)
    # The number of words is the number of spaces plus one
    word_count.append([1 + x.strip().count(' ') for x in splitted])

# [[9], [7, 9], [7], [3, 8]]
print(word_count)

You'll need the library nltk你需要图书馆 nltk

from nltk import sent_tokenize, word_tokenize

mytext = """I have a dog. 
The dog is called Bob."""

for sent in sent_tokenize(mytext): 
    print(len(word_tokenize(sent)))

Output输出

5
6

Step by step explanation:分步说明:

for sent in sent_tokenize(mytext): 
    print('Sentence >>>',sent) 
    print('List of words >>>',word_tokenize(sent)) 
    print('Count words per sentence>>>', len(word_tokenize(sent))) 

Output:输出:

Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6

Use str.rstrip('\\n') to remove the \\n at the end of each sentence.使用str.rstrip('\\n')删除每个句子末尾的\\n

To count the words in a sentence, you can use len(sentence.split(' '))要计算句子中的单词,您可以使用len(sentence.split(' '))

To transform a list of sentences into a list of counts, you can use the map function.要将句子列表转换为计数列表,您可以使用map函数。

So here it is:所以这里是:

import re

with open("testing.txt") as file:
    for i, line in enumerate(file.readlines()):
        # Ignore empty lines
        if line.strip(' ') != '\n':
            line = line.lower()
            # Split by semicolons
            parts = re.split(';', line)
            print("SENTENCES:", parts)
            counts = list(map(lambda part: len(part.split()), parts))
            print("COUNTS:", counts)

Outputs输出

SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]

` `

import re
sentences = []                                                   #empty list for storing result
with open('testtext.txt') as fileObj:
    lines = [line.strip() for line in fileObj if line.strip()]   #makin list of lines allready striped from '\n's
for line in lines:
    sentences += re.split(';', line)                             #spliting lines by ';' and store result in sentences
for sentence in sentences:
    print(sentence +' ' + str(len(sentence.split())))            #out

try this one:试试这个:

import re
  with open("testing.txt") as file:
  read_file = file.readlines()
  for i, word in enumerate(read_file):
  low = word.lower()
  low = low.strip()
  low = low.replace('\n', '')
  re.split(';',low)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM