繁体   English   中英

Python-无法将txt文件中的行拆分为单词

[英]Python - Unable to split lines from a txt file into words

我的目标是打开一个文件并将其拆分为唯一的单词并显示该列表(以及数字计数)。 我认为我必须将文件拆分为几行,然后将这些行拆分为单词,并将其全部添加到列表中。

问题是,如果我的程序将在无限循环中运行并且不显示任何结果,或者它只会读取一行,然后停止。 正在读取的文件是葛底斯堡地址。

def uniquify( splitz, uniqueWords, lineNum ):
for word in splitz:
    word = word.lower()        
    if word not in uniqueWords:
        uniqueWords.append( word )

def conjunctionFunction():

    uniqueWords = []

    with open(r'C:\Users\Alex\Desktop\Address.txt') as f :
        getty = [line.rstrip('\n') for line in f]
    lineNum = 0
    lines = getty[lineNum]
    getty.append("\n")
    while lineNum < 20 :
        splitz = lines.split()
        lineNum += 1

        uniquify( splitz, uniqueWords, lineNum )
    print( uniqueWords )


conjunctionFunction()

使用您当前的代码,该行:

lines = getty[lineNum]

应该在while循环内移动。

您发现了代码的问题所在,但是尽管如此,我还是会略有不同。 由于您需要跟踪唯一单词的数量及其数量,因此您应该使用字典来完成此任务:

wordHash = {}

with open('C:\Users\Alex\Desktop\Address.txt', 'r') as f :
    for line in f:
       line = line.rstrip().lower()

       for word in line:
            if word not in wordHash:
                wordHash[word] = 1

            else: 
                wordHash[word] += 1

print wordHash
def splitData(filename):
    return [words for words in open(filename).reads().split()]

将文件拆分为单词的最简单方法:)

假设inp从文件retrived

inp = """Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense."""


data = inp.splitlines()

print data

_d = {}

for line in data:
    word_lst = line.split()
    for word in word_lst:
        if word in _d:
            _d[word] += 1
        else:
            _d[word] = 1

print _d.keys()

产量

['Beautiful', 'Flat', 'Simple', 'is', 'dense.', 'Explicit', 'better', 'nested.', 'Complex', 'ugly.', 'Sparse', 'implicit.', 'complex.', 'than', 'complicated.']

我建议:

#!/usr/local/cpython-3.3/bin/python

import pprint
import collections

def genwords(file_):
    for line in file_:
        for word in line.split():
            yield word

def main():
    with open('gettysburg.txt', 'r') as file_:
        result = collections.Counter(genwords(file_))

    pprint.pprint(result)

main()

...但是您可以使用re.findall更好地处理标点符号,而不是使用string.split。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM