I am new to Python but struggling to see a clear answer to this issue I am having. I need to split a large text file into chunks less than 1MB (500000 characters to be safe for 1-2 byte characters) however I need this to break at the closest line break without going over. Since there is no clear way to determine filesize, I took the following approach to find the line before the character limit was reached (not perfect, but based on the assumption that most characters are 1 byte this is safe)
chars = words = lines = 0
with open('rawfile.txt', 'r') as in_file:
for line in in_file:
while chars < 500000:
lines += 1
words += len(line.split())
chars += len(line)
#print lines, words, chars
linebreak = lines -1
print linebreak
chars = words = lines = 0
This returns the line before the character count exceeds the 500000 character limit.
I am struggling to do the following:
Set the start_line
to 0, end_line to linebreak
save start_line
to end_line
to a new file
start function again from line linebreak
Any suggestions? Open to a better method as well.
Don't do it that way; instead, write the lines while you're reading them the first time. When you hit a line that is about to take you over the limit, close off the current file and start a new one.
chars = words = lines = fnum = 0
limit = 500000
in_file = open('newfile_' + str(fnum) + '.txt', 'r')
with open('rawfile.txt', 'r') as in_file:
for line in in_file:
lines += 1
words += len(line.split())
if chars + len(line) > limit:
# close in_file and open the next one
in_file.close()
fnum += 1
chars = words = lines = fnum = 0
in_file = open('newfile_' + str(fnum) + '.txt', 'r')
in_file.write(line)
chars = chars + len(line)
Something like that?
# open file for reading
anin = open('temp.txt')
# set the char limit
charlimit = 100
# index of line being processed
anindex = 0
# output text buffer
anout = ''
# index of file to output
acount = 1
def wrapFile():
global anout
if anout == '': return
achunk = 'chunk.' + str(acount) + '.txt'
achunk = open(achunk, 'w')
achunk.write(anout)
achunk.close()
acount += 1
anout = ''
while True:
anindex += 1
aline = anin.readline()
# EOF case
if aline == '':
wrapFile()
anin.close()
break
# next line within limit case
if len(anout + aline) <= charlimit:
anout += aline
continue
# next line out of limit cases
if len(anout) > 0:
wrapFile()
anout = aline
# new line is below char limit itself
if len(anout) < charlimit:
continue
# new line exceeds char limit
print 'Line', anindex, 'alone exceeds the given char limit!'
wrapFile()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.