python，從文本文檔創建過濾列表

Question

每當我嘗試運行該程序時，Python IDLE都會通過告訴我它沒有響應並且必須關閉來做出響應。 關於如何改進此代碼以使其以我想要的方式工作的任何建議？

#open text document
#filter out words in the document by appending to an empty list
#get rid of words that show up more than once
#get rid of words that aren't all lowercase
#get rid of words that end in substring 'xx'
#get rid of words that are less than 5 characters
#print list

fin = open('example.txt')
L = []
for word in fin:
    if len(word) >= 5:
        L.append(word)
    if word != word:
        L.append(word)
    if word[-2:-1] != 'xx':
        L.append(word)
    if word == word.lower():
        L.append(word)
print L

Answer 1

一些一般幫助：

代替

fin = open('example.txt')

你應該用

with open('example.txt', 'r') as fin:

然后縮進其余代碼，但您的版本可以使用。

L = []
for word in fin:

它不是逐字迭代，而是逐行迭代。 如果每行只有一個單詞，那么每行的末尾仍然會有換行符，因此您應該

word = word.rstrip()

清除單詞末尾的空白。 如果您確實想一次做到一個字，則需要兩個 for循環，例如：

for line in fin:
    for word in line.split():

然后將邏輯放入內部循環中。

if len(word) >= 5:
    L.append(word)

去除空格后，任何單詞都將添加五個字母或更長的單詞到列表中。

if word != word:
    L.append(word)

word將始終等於單詞，因此這無濟於事。 如果要消除重復項， set() L set()然后對要添加到列表中的L.add(word)使用L.add(word)而不是L.append(word) （假定順序無關緊要）。

if word[-2:-1] != 'xx':
    L.append(word)

如果您要查看它是否以'xx'結尾，請使用

if not word.endswith('xx'):

而是使用不帶-1 word[-2:] ，否則，您只是在與倒數第二個字母進行比較，而不是整個內容。

if word == word.lower():
    L.append(word)

如果該單詞全部為小寫字母，則會將其添加到列表中。

請記住， 所有這些if測試都將應用於每個單詞 ，因此您需要為每個通過的 測試將單詞添加到列表中一次。 如果只想添加一次，則可以對所有測試使用elif代替if ，除了第一個測試。

您的評論還暗示您通過某種方式將單詞添加到列表中以“擺脫”這些單詞-您不是。 您將保留添加到列表中的那些，其余的將消失。 您不會以任何方式更改文件。

Answer 2

import re

def by_words(it):
    pat = re.compile('\w+')
    for line in it:
        for word in pat.findall(line):
            yield word

def keepers(it):
     words = set()
     for s in it:
         if len(s)>=5 and s==s.lower() and not s.endswith('xx'):
             words.add(s)
     return list(words)

從《戰爭與和平》中獲得5個詞：

from urllib import urlopen
source = urlopen('http://www.gutenberg.org/ebooks/2600.txt.utf8')
print keepers(by_words(source))[:5]

版畫

['raining', 'divinely', 'hordes', 'nunnery', 'parallelogram']

這不會占用太多內存。 戰爭與和平只有14361個符合您條件的單詞。 迭代器僅在很小的塊上工作。

Answer 3

我為你做了功課，我很無聊。 可能有一個錯誤。

homework_a_plus = []
#open text document
with open('example.txt', 'r') as fin:
    for word in fin:
        #get rid of words that show up more than once
        if word in homework_a_plus:
            continue
        #get rid of words that aren't all lowercase
        for c in word:
            if c.isupper():
                continue
        #get rid of words that end in substring 'xx'
        if word[-2:] == 'xx':
            continue
        #get rid of words that are less than 5 characters
        if len(word) < 5:
            continue
        homework_a_plus.append(word)
print homework_a_plus

編輯：就像Wooble所說的那樣，您的邏輯在您提供的代碼中很遙遠。 將您的代碼與我的代碼進行比較，我想您會理解為什么您的代碼有問題。

Answer 4

words = [inner for outer in [line.split() for line in open('example.txt')] for inner in outer]

for word in words[:]:
    if words.count(word) > 1 or word.lower() != word or word[-2:] == 'xx' or len(word) < 5:
        words.remove(word)
print words

Answer 5

如果您想將其更多地寫為過濾器...我會采取略有不同的方法。

fin = open('example.txt','r')
seenList = []
for line in fin:
    for word in line.split():
        if word in seenList: continue
        if word[-2:] == 'xx': continue
        if word.lower() != word: continue
        if len(word) < 5: continue
        seenList.append(word)
        print word

附帶的好處是向您顯示每行的輸出。 如果要輸出到文件，請適當修改print word行，或使用shell重定向。

編輯：如果您真的不想打印任何重復的單詞（上面只是跳過第一個實例之后的每個實例），那么類似的方法就可以...

fin = open('example.txt','r')
seenList = []
for line in fin:
    for word in line.split():
        if word in seenList: 
            seenList.remove(word)
            continue
        if word[-2:] == 'xx': continue
        if word.lower() != word: continue
        if len(word) < 5: continue
        seenList.append(word)

print seenList

Answer 6

使用正則表達式很簡單：

import re

li = ['bubble', 'iridescent', 'approxx', 'chime',
      'Azerbaidjan', 'moon', 'astronomer', 'glue', 'bird',
      'plan_ary', 'suxx', 'moon', 'iridescent', 'magnitude',
      'Spain', 'through', 'macGregor', 'iridescent', 'ben',
      'glomoxx', 'iridescent', 'orbital']

reg1 = re.compile('(?!\S*?[A-Z_]\S*(?=\Z))'
                 '\w{5,}'
                 '(?<!xx)\Z')

print set(filter(reg1.match,li))

# result:

set(['orbital', 'astronomer', 'magnitude', 'through', 'iridescent', 'chime', 'bubble'])

如果數據不在列表中，而是在字符串中：

ss = '''bubble iridescent approxx chime
Azerbaidjan moon astronomer glue bird
plan_ary suxx moon iridescent magnitude
Spain through macGregor iridescent ben
glomoxx iridescent orbital'''

print set(filter(reg1.match,ss.split()))

要么

reg2 = re.compile('(?:(?<=\s)|(?<=\A))'
                 '(?!\S*?[A-Z_]\S*(?=\s|\Z))'
                 '\w{5,}'
                 '(?<!xx)'
                 '(?=\s|\Z)')

print set(reg2.findall(ss))

python，從文本文檔創建過濾列表

問題描述

6 個解決方案

解決方案1
4 2011-09-30 18:01:16

解決方案2
2 2011-09-30 18:34:02

解決方案3
0 2011-09-30 17:59:16

解決方案4
0 2011-09-30 18:04:34

解決方案5
0 2011-09-30 18:05:46

解決方案6
0 2011-10-01 11:20:09

python，從文本文檔創建過濾列表

問題描述

6 個解決方案

解決方案1 4 2011-09-30 18:01:16

解決方案2 2 2011-09-30 18:34:02

解決方案3 0 2011-09-30 17:59:16

解決方案4 0 2011-09-30 18:04:34

解決方案5 0 2011-09-30 18:05:46

解決方案6 0 2011-10-01 11:20:09

解決方案1
4 2011-09-30 18:01:16

解決方案2
2 2011-09-30 18:34:02

解決方案3
0 2011-09-30 17:59:16

解決方案4
0 2011-09-30 18:04:34

解決方案5
0 2011-09-30 18:05:46

解決方案6
0 2011-10-01 11:20:09