简体   繁体   中英

Special End-line characters/string from lines read from text file, using Python

I need to read lines from a text file but, where the 'end of line' caracter is not always \\n or \\x or a combination and may be any combination of characters like 'xyz' or '|', but the 'end of line' is always the same and known for each type of file.

As the text file may be a big one and I have to keep performances and memory usage in mind what seems to be the best solution ? Today I use a combinaison of string.read(1000) and split(myendofline) or partition(myendofline) but I would know if a more elegant and standard solution exists.

Obviously simplest would be to just read the whole thing and then call .split('|') .

However if that's undesirable because it requires you to read the whole thing into memory you might read in arbitrary chunks and perform the split on them. You could write a class that grabs another arbitrary chunk when the current one runs out, and the rest of your application doesn't need to know about it.

Here's the input, zen.txt

The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!

Here's my little test case, that works for me. It doesn't handle a whole bunch corner cases, nor is it particularly pretty, but it should get you started.

class SpecialDelimiters(object):
    def __init__(self, filehandle, terminator, chunksize=10):
        self.file = filehandle
        self.terminator = terminator
        self.chunksize = chunksize
        self.chunk = ''
        self.lines = []
        self.done = False

    def __iter__(self):
        return self

    def next(self):
        if self.done:
            raise StopIteration
        try:
            return self.lines.pop(0)
        except IndexError:
            #The lines list is empty, so let's read some more!
            while True:
                #Looping so even if our chunksize is smaller than one line we get at least one chunk
                newchunk = self.file.read(self.chunksize)
                self.chunk += newchunk
                rawlines = self.chunk.split(self.terminator)
                if len(rawlines) > 1 or not newchunk:
                    #we want to keep going until we have at least one block
                    #or reached the end of the file
                    break
            self.lines.extend(rawlines[:-1])
            self.chunk = rawlines[-1]
            try:
                return self.lines.pop(0)
            except IndexError:
                #The end of the road, return last remaining stuff
                self.done = True
                return self.chunk               

zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
    print line  

Here's a generator function thats acts as an iterator on a file, cuting the lines according exotic newline being identical in all the file.

It reads the file by chunks of lenchunk characters and displays the lines in each current chunk, chunk after chunk.

Since the newline is 3 characters in my exemple (':;:'), it may happen that a chunk ends with a cut newline: this generator function takes care of this possibility and manages to display the correct lines.

In case of a newline being only one character, the function could be simplified. I wrote only the function for the most delicate case.

Employing this function allows to read a file one line at a time, without reading the entire file into memory.

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    # nl = 0 or 1 acts as 0 or 1 in splitlines()
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            chunk = tail + kept
            tail = newtail
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            chunk = f.read(lenchunk)
        yield tail



for line in liner('fofo.txt',':;:'):
    print line

Here's the same, with printings here and there to allow to follow the algorithm.

from random import randrange, choice


# this part is to create an exemple file with newline being :;:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40)))
                for i in xrange(50))
with open('fofo.txt','wb') as g:
    g.write(ch)


# this generator function is an iterator for a file
# if nl receives an argument whose bool is True,
# the newlines :;: are returned in the lines

def liner(filename,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    with open(filename,'rb') as f:
        ch = f.read()
        the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\
                  '\nend of the file=='+ch[-50:]+\
                  '\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n'
        f.seek(0,0)
        chunk = f.read(lenchunk)
        tail = ''
        while chunk:
            if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;':
                wr = [' ##########---------- cut newline cut ----------##########'+\
                     '\nchunk== '+chunk+\
                     '\n---------------------------------------------------']
            else:
                wr = ['chunk== '+chunk+\
                     '\n---------------------------------------------------']
            last = chunk.rfind(eol)
            if last==-1:
                kept = chunk
                newtail = ''
            else:
                kept = chunk[0:last+L]   # here: L
                newtail = chunk[last+L:] # here: L
            wr.append('\nkept== '+kept+\
                      '\n---------------------------------------------------'+\
                      '\nnewtail== '+newtail)
            chunk = tail + kept
            tail = newtail
            wr.append('\n---------------------------------------------------'+\
                      '\ntail + kept== '+chunk+\
                      '\n---------------------------------------------------')
            print ''.join(wr)
            x = y = 0
            while y+1:
                y = chunk.find(eol,x)
                if y+1: yield chunk[x:y+NL] # here: NL
                else: break
                x = y+L # here: L
            print '\n\n==================================================='
            chunk = f.read(lenchunk)
        yield tail
        print the_end



for line in liner('fofo.txt',':;:',1):
    print 'line== '+line

.

EDIT

I compared the times of execution of my code and of the chmullig's code.

With a 'fofo.txt' file about 10 MB, created with

alphabet = 'abcdefghijklmnopqrstuvwxyz '
ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60)))
                for i in xrange(324000))
with open('fofo.txt','wb') as g:
    g.write(ch)

and measuring times like that:

te = clock()
for line in liner('fofo.txt',':;:', 65536):
    pass
print clock()-te


fh = open('fofo.txt', 'rb')
zenBreaker = SpecialDelimiters(fh, ':;:', 65536)

te = clock()
for line in zenBreaker:
    pass
print clock()-te

I obtained the following minimum times observed on several essays:

............my code 0,7067 seconds

chmullig's code 0.8373 seconds

.

EDIT 2

I changed my generator function: liner2() takes a file-handler instead of the file's name. So the opening of the file can be put out of the measuring of time, as it is for the measuring of chmullig's code

def liner2(fh,eol,lenchunk,nl=0):
    L = len(eol)
    NL = len(eol) if nl else 0
    chunk = fh.read(lenchunk)
    tail = ''
    while chunk:
        last = chunk.rfind(eol)
        if last==-1:
            kept = chunk
            newtail = ''
        else:
            kept = chunk[0:last+L]   # here: L
            newtail = chunk[last+L:] # here: L
        chunk = tail + kept
        tail = newtail
        x = y = 0
        while y+1:
            y = chunk.find(eol,x)
            if y+1: yield chunk[x:y+NL] # here: NL
            else: break
            x = y+L # here: L
        chunk = fh.read(lenchunk)
    yield tail

fh = open('fofo.txt', 'rb')
te = clock()
for line in liner2(fh,':;:', 65536):
    pass
print clock()-te

The results, after numerous essays to see the minimum times, are

.........with liner() 0.7067seconds

.......with liner2() 0.7064 seconds

chmullig's code 0.8373 seconds

In fact the opening of the file counts for an infinitesimal part in the total time.

Given your contraints, it maybe would be best to convert the known unusual newlines to normal newlines first and then use the usual:

for line in file:
    ...

TextFileData.split(EndOfLine_char) seems to be your solution. If it's not working fast enough, then you should consider using a lower-level programming level.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM