简体   繁体   中英

python - split file into multiple txt files at highest line number without going over a max filesize based on character count

I am new to Python but struggling to see a clear answer to this issue I am having. I need to split a large text file into chunks less than 1MB (500000 characters to be safe for 1-2 byte characters) however I need this to break at the closest line break without going over. Since there is no clear way to determine filesize, I took the following approach to find the line before the character limit was reached (not perfect, but based on the assumption that most characters are 1 byte this is safe)

chars = words = lines = 0


with open('rawfile.txt', 'r') as in_file:

        for line in in_file:
            while chars < 500000:
                lines += 1
                words += len(line.split())
                chars += len(line)
        #print lines, words, chars
        linebreak = lines -1
        print linebreak
        chars = words = lines = 0

This returns the line before the character count exceeds the 500000 character limit.

I am struggling to do the following:

Set the start_line to 0, end_line to linebreak
save start_line to end_line to a new file
start function again from line linebreak

Any suggestions? Open to a better method as well.

Don't do it that way; instead, write the lines while you're reading them the first time. When you hit a line that is about to take you over the limit, close off the current file and start a new one.

chars = words = lines = fnum = 0
limit = 500000

in_file = open('newfile_' + str(fnum) + '.txt', 'r')
with open('rawfile.txt', 'r') as in_file:

    for line in in_file:
        lines += 1
        words += len(line.split())
        if chars + len(line) > limit:
            # close in_file and open the next one
            in_file.close()
            fnum += 1
            chars = words = lines = fnum = 0
            in_file = open('newfile_' + str(fnum) + '.txt', 'r')

        in_file.write(line)
        chars = chars + len(line)

Something like that?

# open file for reading
anin = open('temp.txt')

# set the char limit
charlimit = 100

# index of line being processed
anindex = 0

# output text buffer
anout = ''

# index of file to output
acount = 1

def wrapFile():
    global anout

    if anout == '': return

    achunk = 'chunk.' + str(acount) + '.txt' 
    achunk = open(achunk, 'w')
    achunk.write(anout)
    achunk.close()
    acount += 1
    anout = ''

while True:
    anindex += 1
    aline = anin.readline()

    # EOF case
    if aline == '':
        wrapFile()
        anin.close()
        break

    # next line within limit case
    if len(anout + aline) <= charlimit:
        anout += aline
        continue

    # next line out of limit cases
    if len(anout) > 0:
        wrapFile()

    anout = aline

    # new line is below char limit itself
    if len(anout) < charlimit:
        continue

    # new line exceeds char limit
    print 'Line', anindex, 'alone exceeds the given char limit!'
    wrapFile()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM