簡體   English   中英

使用Python從文本文件讀取的行中的特殊結束行字符/字符串

[英]Special End-line characters/string from lines read from text file, using Python

我需要從文本文件中讀取行,但是其中“行尾”字符並不總是\\ n或\\ x或其組合,並且可以是諸如“ xyz”或“ |”之類的字符的任意組合,而是“ end”的“行”始終是相同的,並且對於每種類型的文件都是已知的。

由於文本文件可能很大,因此我必須牢記性能和內存使用情況,似乎最好的解決方案是什么? 今天,我使用string.read(1000)和split(myendofline)或partition(myendofline)的組合,但我知道是否存在更優雅,更標准的解決方案。

顯然,最簡單的方法是先閱讀整個內容,然后調用.split('|')

但是,如果這是不希望的,因為這需要您將整個內容讀取到內存中,則可以讀取任意塊並對其進行拆分。 您可以編寫一個類,該類在當前的一個塊用完時可以捕獲另一個任意塊,而您的應用程序的其余部分則不需要了解它。

這是輸入zen.txt

The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!

這是我的小測試用例,適用於我。 它不能處理很多角落的情況,也不是特別漂亮,但是應該可以幫助您入門。

class SpecialDelimiters(object):
    def __init__(self, filehandle, terminator, chunksize=10):
        self.file = filehandle
        self.terminator = terminator
        self.chunksize = chunksize
        self.chunk = ''
        self.lines = []
        self.done = False

    def __iter__(self):
        return self

    def next(self):
        if self.done:
            raise StopIteration
        try:
            return self.lines.pop(0)
        except IndexError:
            #The lines list is empty, so let's read some more!
            while True:
                #Looping so even if our chunksize is smaller than one line we get at least one chunk
                newchunk = self.file.read(self.chunksize)
                self.chunk += newchunk
                rawlines = self.chunk.split(self.terminator)
                if len(rawlines) > 1 or not newchunk:
                    #we want to keep going until we have at least one block
                    #or reached the end of the file
                    break
            self.lines.extend(rawlines[:-1])
            self.chunk = rawlines[-1]
            try:
                return self.lines.pop(0)
            except IndexError:
                #The end of the road, return last remaining stuff
                self.done = True
                return self.chunk               

zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
    print line  

這是一個生成器函數,它充當文件上的迭代器 ,根據所有文件中相同的奇異換行符剪切行。

它按大塊lenchunk字符讀取文件,並在每個大塊中lenchunk顯示行。

由於在我的示例中,換行符是3個字符(':;:'),所以可能會發生以下情況:大塊以剪切換行符結尾:此生成器函數負責這種可能性,並設法顯示正確的行。

如果換行符只是一個字符,則可以簡化功能。 我只為最微妙的情況編寫了函數。

使用此功能可以一次只讀取一行文件,而無需將整個文件讀取到內存中。

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    # nl = 0 or 1 acts as 0 or 1 in splitlines()
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            chunk = tail + kept
            tail = newtail
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            chunk = f.read(lenchunk)
        yield tail



for line in liner('fofo.txt',':;:'):
    print line

這是相同的,這里和那里的印刷允許遵循算法。

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        ch = f.read()
        the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\
                  '\nend of the file=='+ch[-50:]+\
                  '\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'
        f.seek(0,0)
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;':
                wr = [' ##########---------- cut newline cut ----------##########'+\
                     '\nchunk== '+chunk+\
                     '\n---------------------------------------------------']
            else:
                wr = ['chunk== '+chunk+\
                     '\n---------------------------------------------------']
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            wr.append('\nkept== '+kept+\
                      '\n---------------------------------------------------'+\
                      '\nnewtail== '+newtail)
            chunk = tail + kept
            tail = newtail
            wr.append('\n---------------------------------------------------'+\
                      '\ntail + kept== '+chunk+\
                      '\n---------------------------------------------------')
            print ''.join(wr)
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            print '\n\n==================================================='
            chunk = f.read(lenchunk)
        yield tail
        print the_end



for line in liner('fofo.txt',':;:',1):
    print 'line== '+line

編輯

我比較了我的代碼和chmullig的代碼的執行時間。

使用約10 MB的“ fofo.txt”文件創建

alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60)))
                for i in xrange(324000))
with open('fofo.txt','wb') as g:
    g.write(ch)

並像這樣測量時間:

te = clock()
for line in liner('fofo.txt',':;:', 65536):
    pass
print clock()-te


fh = open('fofo.txt', 'rb')
zenBreaker = SpecialDelimiters(fh, ':;:', 65536)

te = clock()
for line in zenBreaker:
    pass
print clock()-te

我在幾篇文章中獲得了以下最少時間:

............我的代碼0,7067秒

chmullig的代碼0.8373秒

編輯2

我更改了生成器函數: liner2()接受文件處理程序而不是文件名。 因此,文件的打開可以不計入時間,因為它用於chmullig代碼的計測

def liner2(fh,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    chunk = fh.read(lenchunk)
    tail = ''
    while chunk:
        last = chunk.rfind(eol)
        if last==-1:
            kept = chunk
            newtail = ''
        else:
            kept = chunk[0:last+L]   # here: L
            newtail = chunk[last+L:] # here: L
        chunk = tail + kept
        tail = newtail
        x = y = 0
        while y+1:
            y = chunk.find(eol,x)
            if y+1: yield chunk[x:y+NL] # here: NL
            else: break
            x = y+L # here: L
        chunk = fh.read(lenchunk)
    yield tail

fh = open('fofo.txt', 'rb')
te = clock()
for line in liner2(fh,':;:', 65536):
    pass
print clock()-te

經過大量文章以查看最短時間后的結果是

......... with線性()0.7067秒

....... with線性2()0.7064秒

chmullig的代碼0.8373秒

實際上,打開文件在整個時間中只占很小的一部分。

給定您的約束,最好先將已知的異常換行轉換為普通換行,然后再使用通常的換行:

for line in file:
    ...

TextFileData.split(EndOfLine_char)似乎是您的解決方案。 如果運行速度不夠快,則應考慮使用較低級別的編程級別。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM