[英]Special End-line characters/string from lines read from text file, using Python
我需要从文本文件中读取行,但是其中“行尾”字符并不总是\\ n或\\ x或其组合,并且可以是诸如“ xyz”或“ |”之类的字符的任意组合,而是“ end”的“行”始终是相同的,并且对于每种类型的文件都是已知的。
由于文本文件可能很大,因此我必须牢记性能和内存使用情况,似乎最好的解决方案是什么? 今天,我使用string.read(1000)和split(myendofline)或partition(myendofline)的组合,但我知道是否存在更优雅,更标准的解决方案。
显然,最简单的方法是先阅读整个内容,然后调用.split('|')
。
但是,如果这是不希望的,因为这需要您将整个内容读取到内存中,则可以读取任意块并对其进行拆分。 您可以编写一个类,该类在当前的一个块用完时可以捕获另一个任意块,而您的应用程序的其余部分则不需要了解它。
这是输入zen.txt
The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!
这是我的小测试用例,适用于我。 它不能处理很多角落的情况,也不是特别漂亮,但是应该可以帮助您入门。
class SpecialDelimiters(object):
def __init__(self, filehandle, terminator, chunksize=10):
self.file = filehandle
self.terminator = terminator
self.chunksize = chunksize
self.chunk = ''
self.lines = []
self.done = False
def __iter__(self):
return self
def next(self):
if self.done:
raise StopIteration
try:
return self.lines.pop(0)
except IndexError:
#The lines list is empty, so let's read some more!
while True:
#Looping so even if our chunksize is smaller than one line we get at least one chunk
newchunk = self.file.read(self.chunksize)
self.chunk += newchunk
rawlines = self.chunk.split(self.terminator)
if len(rawlines) > 1 or not newchunk:
#we want to keep going until we have at least one block
#or reached the end of the file
break
self.lines.extend(rawlines[:-1])
self.chunk = rawlines[-1]
try:
return self.lines.pop(0)
except IndexError:
#The end of the road, return last remaining stuff
self.done = True
return self.chunk
zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
print line
这是一个生成器函数,它充当文件上的迭代器 ,根据所有文件中相同的奇异换行符剪切行。
它按大块lenchunk
字符读取文件,并在每个大块中lenchunk
显示行。
由于在我的示例中,换行符是3个字符(':;:'),所以可能会发生以下情况:大块以剪切换行符结尾:此生成器函数负责这种可能性,并设法显示正确的行。
如果换行符只是一个字符,则可以简化功能。 我只为最微妙的情况编写了函数。
使用此功能可以一次只读取一行文件,而无需将整个文件读取到内存中。
from random import randrange, choice
# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
for i in xrange(50))
with open('fofo.txt','wb') as g:
g.write(ch)
# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines
def liner(filename,eol,lenchunk,nl=0):
# nl = 0 or 1 acts as 0 or 1 in splitlines()
L = len(eol)
NL = len(eol) if nl else 0
with open(filename,'rb') as f:
chunk = f.read(lenchunk)
tail = ''
while chunk:
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
chunk = tail + kept
tail = newtail
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
chunk = f.read(lenchunk)
yield tail
for line in liner('fofo.txt',':;:'):
print line
这是相同的,这里和那里的印刷允许遵循算法。
from random import randrange, choice
# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
for i in xrange(50))
with open('fofo.txt','wb') as g:
g.write(ch)
# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines
def liner(filename,eol,lenchunk,nl=0):
L = len(eol)
NL = len(eol) if nl else 0
with open(filename,'rb') as f:
ch = f.read()
the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\
'\nend of the file=='+ch[-50:]+\
'\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'
f.seek(0,0)
chunk = f.read(lenchunk)
tail = ''
while chunk:
if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;':
wr = [' ##########---------- cut newline cut ----------##########'+\
'\nchunk== '+chunk+\
'\n---------------------------------------------------']
else:
wr = ['chunk== '+chunk+\
'\n---------------------------------------------------']
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
wr.append('\nkept== '+kept+\
'\n---------------------------------------------------'+\
'\nnewtail== '+newtail)
chunk = tail + kept
tail = newtail
wr.append('\n---------------------------------------------------'+\
'\ntail + kept== '+chunk+\
'\n---------------------------------------------------')
print ''.join(wr)
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
print '\n\n==================================================='
chunk = f.read(lenchunk)
yield tail
print the_end
for line in liner('fofo.txt',':;:',1):
print 'line== '+line
。
编辑
我比较了我的代码和chmullig的代码的执行时间。
使用约10 MB的“ fofo.txt”文件创建
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60)))
for i in xrange(324000))
with open('fofo.txt','wb') as g:
g.write(ch)
并像这样测量时间:
te = clock()
for line in liner('fofo.txt',':;:', 65536):
pass
print clock()-te
fh = open('fofo.txt', 'rb')
zenBreaker = SpecialDelimiters(fh, ':;:', 65536)
te = clock()
for line in zenBreaker:
pass
print clock()-te
我在几篇文章中获得了以下最少时间:
............我的代码0,7067秒
chmullig的代码0.8373秒
。
编辑2
我更改了生成器函数: liner2()
接受文件处理程序而不是文件名。 因此,文件的打开可以不计入时间,因为它用于chmullig代码的计测
def liner2(fh,eol,lenchunk,nl=0):
L = len(eol)
NL = len(eol) if nl else 0
chunk = fh.read(lenchunk)
tail = ''
while chunk:
last = chunk.rfind(eol)
if last==-1:
kept = chunk
newtail = ''
else:
kept = chunk[0:last+L] # here: L
newtail = chunk[last+L:] # here: L
chunk = tail + kept
tail = newtail
x = y = 0
while y+1:
y = chunk.find(eol,x)
if y+1: yield chunk[x:y+NL] # here: NL
else: break
x = y+L # here: L
chunk = fh.read(lenchunk)
yield tail
fh = open('fofo.txt', 'rb')
te = clock()
for line in liner2(fh,':;:', 65536):
pass
print clock()-te
经过大量文章以查看最短时间后的结果是
......... with线性()0.7067秒
....... with线性2()0.7064秒
chmullig的代码0.8373秒
实际上,打开文件在整个时间中只占很小的一部分。
给定您的约束,最好先将已知的异常换行转换为普通换行,然后再使用通常的换行:
for line in file:
...
TextFileData.split(EndOfLine_char)
似乎是您的解决方案。 如果运行速度不够快,则应考虑使用较低级别的编程级别。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.