简体   繁体   English

在没有标点符号的 .txt 文件中查找最长的单词

[英]Finding the longest word in a .txt file without punctuation marks

I am doing Python file I/O exercises and albeit made a huge progress on an exercise in which I try to find the longest words in each line of a .txt file, I can't get rid of the punctuation marks .我正在做 Python 文件 I/O 练习,尽管在我尝试在.txt文件的每一行中查找最长单词的练习中取得了巨大进展,但我无法摆脱标点符号

Here is the code I have:这是我的代码:

with open("original-3.txt", 'r') as file1:
lines = file1.readlines()
for line in lines:
    if not line == "\n":
        print(max(line.split(), key=len))

This is the output I get这是我得到的 output

This is the original-3.txt file where I am reading the data from这是我从中读取数据的original-3.txt文件

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe.

"Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!"

He took his vorpal sword in hand:
Long time the manxome foe he sought,
So rested he by the Tumtum tree,
And stood a while in thought.

And, as in uffish thought he stood,
The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
And burbled as it came!

One two! One two! And through and through
The vorpal blade went snicker-snack!
He left it dead, and with its head
He went galumphing back.

"And hast thou slain the Jabberwock?
Come to my arms, my beamish boy!"
"Oh frabjous day! Callooh! Callay!"
He chortled in his joy.

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

As you can see, I am getting the punctuation marks like ["," ";" "?" "!"]如您所见,我得到了["," ";" "?" "!"]之类的标点符号["," ";" "?" "!"]

How do you think I can only get the words themselves?你怎么认为我只能得到这些词本身?

Thank you谢谢

Using Regex it is very easy to get what is the length of longest word :使用正则表达式很容易得到length of longest word

import re

for line in lines:
    found_strings = re.findall(r'\w+', line)
    print(max([len(txt) for txt in found_strings]))

You have to strip those characters from the words:您必须从单词中strip这些字符:

with open("original-3.txt", 'r') as file1:
    lines = file1.readlines()
for line in lines:
    if not line == "\n":
        print(max(word.strip(",?;!\"") for word in line.split()), key=len))

or you use regular expressions to extract everything that looks like a word (ie consists of letters):或者您使用正则表达式来提取看起来像单词的所有内容(即由字母组成):

import re


for line in lines: 
    words = re.findall(r"\w+", line) 
    if words: 
        print(max(words, key=len)) 

This solution does not use regular expressions.此解决方案不使用正则表达式。 It splits the line into words, and then sanitizes each word so that it only contains alphabetical characters.它将行拆分为单词,然后对每个单词进行清理,使其仅包含字母字符。

with open("original-3.txt", 'r') as file1:
    lines = file1.readlines()
    for line in lines:
        if not line == "\n":
            words = line.split()
            for i, word in enumerate(words):
                words[i] = "".join([letter for letter in word if letter.isalpha()])
            print(max(words, key=len))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM