简体   繁体   English

如何从python中的文本文件中删除所有带有大写字母和数字和特殊字符的行以及所有长度超过10个字符的行

[英]How to remove all lines with caps AND digits AND special characters AND all the lines longer than 10 characters from a text file in python

I have a text file with all existing words in the Dutch language and I need only the words with a specific amount of characters, without any digits or special characters or capitals.我有一个包含荷兰语中所有现有单词的文本文件,我只需要具有特定数量字符的单词,没有任何数字或特殊字符或大写字母。 I tried to do it by hand (which works) but it's about 400 thousand words :) So I wanted to use Python.我尝试手动完成(可行),但大约有 40 万字 :) 所以我想使用 Python。 I'm very new to Python and I can't find a good solution.我对 Python 很陌生,找不到好的解决方案。 With my code (which is far from optimal) I get results but not good enough.使用我的代码(远非最佳),我得到了结果,但还不够好。 Some words seem to be split halfway and concatenated, in some lines two words are not put on a separate line (to name a few things that I don't want).有些单词似乎被中途拆分并连接起来,在某些行中,两个单词没有放在单独的行上(仅举几例我不想要的东西)。

My question: Is there a simple code that can remove words longer than 10 characters, remove all words starting or containing a Cap, remove all words with special characters?我的问题:是否有一个简单的代码可以删除超过 10 个字符的单词,删除所有开头或包含 Cap 的单词,删除所有带有特殊字符的单词? Thank you all in advance.谢谢大家。

My code:我的代码:

import re

input_file = open("basiswoorden-gekeurd.txt", "r+")
output_file = open("word_crumble_wordlist.txt", "w")
filetext = input_file.read()
res_caps = re.sub(r"\s*[A-Z]\w*\s*", " ", filetext).strip()
res_dig = re.sub(r"\s*\d\w*\s*", "", res_caps).strip()
res = re.sub(r"[^a-zA-Z0-9\n\.]\w*\s*", "", res_dig).strip()

for line in res:
    if len(line) < 10:
        output_file.write(line)

Original part of word-list: Original: see the numbers and special characters词表的原文部分:原文:见数字和特殊字符

Resulting part: Result: looks ok but the word "aaaaagje" seems a combination of other words :) HOW?结果部分:结果:看起来不错,但“aaaaagje”这个词似乎是其他词的组合:) 怎么样?

Also: Original, with "aanbevolencomité AND aanbevolen" as two separate words on two separate lines And: See "aanbevolencomitaanbevolen"另外:原始,将“aanbevolencomité AND aanbevolen”作为两个单独的单词放在两个单独的行中并且:参见“aanbevolencomitaanbevolen”

In this case it might be easier to find matching words, rather than delete unwanted, consider following example let file.txt content be在这种情况下,找到匹配的单词可能会更容易,而不是删除不需要的单词,请考虑以下示例让file.txt内容为

Capital
okay
thisistoolong
okaytoo
d.o.t.s

then然后

import re
with open("file.txt","r") as f:
    text = f.read()
for i in re.findall(r'^[a-z]{1,10}$',text,re.MULTILINE):
    print(i)

gives output给出输出

okay
okaytoo

Explanation: I use MULTLINE line mode so ^ and $ mean start of line and end of line, then I am finding lines which contain from 1 to 10 lowercase ASCII letters.说明:我使用MULTLINE行模式,所以^$表示行首和行尾,然后我找到包含 1 到 10 个小写 ASCII 字母的行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM