删除标点符号后从文本文件中打印唯一单词列表，并找到最长的单词

Question

目标是a) 从文本文件中打印唯一单词列表，并且b) 找到最长的单词。

我不能在这个挑战中使用进口。

文件处理和主要功能是我想要的，但是需要清理列表。 从 output 可以看出，单词与标点符号连接在一起，因此maxLength显然是不正确的。

with open("doc.txt") as reader, open("unique.txt", "w") as writer:

    unwanted = "[],."
    unique = set(reader.read().split())
    unique = list(unique) 
    unique.sort(key=len)
    regex = [elem.strip(unwanted).split() for elem in unique]
    writer.write(str(regex))
    reader.close()

    maxLength = len(max(regex,key=len ))
    print(maxLength)
    res = [word for word in regex if len(word) == maxLength]
    print(res)



===========

样本：

50 多年前开创了综合实习年的概念 [7][8][9]，超过 70% 的学生参加了实习年，是英国最高的。 [10]

Answer 1

这是一个使用str.translate() ) 在我们执行split()之前丢弃所有坏字符（+换行符）的解决方案。 （通常我们会使用带有re.sub()的正则表达式，但不允许这样做。）这使得清洁变得单行，这真的很整洁：

bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))

# We can directly read and clean the entire output, without a reader object: 
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
#    cleaned_input = reader.read().translate(bad_transtable)

# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))   

with open("unique.txt", "w") as writer:
    for word in unique_words:
        writer.write(f'{word}\n')

max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])

笔记：

由于输入已经 100% 清理和拆分，因此无需像 go 那样将 append 放入列表/插入到集合中，然后必须稍后再进行一次清理。 我们可以直接创建unique_words ！ （使用set()只保留唯一性）。 当我们这样做时，我们不妨使用sorted(..., key=lambda w: -len(w))以递减的长度对其进行排序。 只需要调用sort()一次。 并且没有迭代附加到列表。
因此我们保证max_length = len(unique_words[0])
这种方法也将比嵌套循环for line in <lines>: for word in line.split(): ...iterative append() to wordlist
无需做明确的writer/reader 。 open()/.close() ，这就是with语句为您所做的。 （当异常发生时处理 IO 也更优雅。）
您还可以在 writer 循环中合并 max_length 单词的打印。 但将它们分开是更简洁的代码。
请注意，当我们write() output 行时，我们使用f 字符串格式f'{word}\n'来添加换行符
在 Python 中，我们使用 lower_case_with_underscores 作为变量名，因此max_length不是maxLength 。 见PEP8
实际上，在这里，我们并不严格需要作者的 with 语句，如果我们要做的只是在一个 go 中使用open('doc.txt').read()将其全部内容啜饮。 （这对于大文件是不可扩展的，您必须以块或 n 行的形式读取）。
str.maketrans()是内置的，但如果您的老师反对模块引用，您也可以在绑定字符串上调用它，例如' '.maketrans()
str.maketrans()真的是回到我们只有 95 个可打印 ASCII 字符的日子，而不是 Unicode。 它仍然适用于 Unicode ，但构建和使用巨大的翻译字典很烦人，并且使用 memory，Unicode 上的正则表达式更容易，您可以定义整个字符类。

如果您还不知道`str.translate()`替代解决方案

dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
    cleaned_input = cleaned_input.replace(bad_char, ' ')

而且，如果您想成为可笑的极简主义者，则可以将整个 output 文件写在一行中，并带有列表理解。 不要这样做，调试会很糟糕，例如，如果您无法打开/写入/覆盖 output 文件，或者出现 IOError，或者 unique_words 不是列表等：

open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])

Answer 2

这是一个解决方案。 诀窍是使用 python str 方法.isalpha()过滤非字母数字。

with open("unique.txt", "w") as writer:
    with open("doc.txt") as reader:
        cleaned_words = []
        for line in reader.readlines():
            for word in line.split():
                cleaned_word = ''.join([c for c in word if c.isalpha()])
                if len(cleaned_word):
                    cleaned_words.append(cleaned_word)

        # print unique words
        unique_words = set(cleaned_words)
        print(unique_words)

        # write words to file? depends what you need here
        for word in unique_words:
            writer.write(str(word))
            writer.write('\n')

        # print length of longest
        print(len(sorted(unique_words, key=len, reverse=True)[0]))

Answer 3

这是另一个没有任何 function 的解决方案。

bad = '`~@#$%^&*()-_=+[]{}\|;\':\".>?<,/?'

clean = ' '
for i in a:
    if i not in bad:
        clean += i
    else:
        clean += ' '

cleans = [i for i in clean.split(' ') if len(i)]

clean_uniq = list(set(cleans))

clean_uniq.sort(key=len)

print(clean_uniq)
print(len(clean_uniq[-1]))

删除标点符号后从文本文件中打印唯一单词列表，并找到最长的单词

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-05-12 18:29:31

如果您还不知道`str.translate()`替代解决方案

解决方案2
1 2020-05-12 17:38:47

解决方案3
1 2020-05-12 17:40:51

删除标点符号后从文本文件中打印唯一单词列表，并找到最长的单词

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-05-12 18:29:31

如果您还不知道str.translate()替代解决方案

解决方案2 1 2020-05-12 17:38:47

解决方案3 1 2020-05-12 17:40:51

解决方案1
2 已采纳 2020-05-12 18:29:31

如果您还不知道`str.translate()`替代解决方案

解决方案2
1 2020-05-12 17:38:47

解决方案3
1 2020-05-12 17:40:51