![](/img/trans.png)
[英]splitting of large size text file into smaller ones based on heading in python
[英]Splitting large text file into smaller text files by line numbers using Python
我有一個文本文件 really_big_file.txt 包含:
line 1
line 2
line 3
line 4
...
line 99999
line 100000
我想編寫一個 Python 腳本,將 really_big_file.txt 分成較小的文件,每個文件 300 行。 例如,small_file_300.txt 有 1-300 行,small_file_600 有 301-600 行,等等,直到有足夠的小文件來包含大文件中的所有行。
對於使用 Python 完成此操作的最簡單方法的任何建議,我將不勝感激
lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
for lineno, line in enumerate(bigfile):
if lineno % lines_per_file == 0:
if smallfile:
smallfile.close()
small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
smallfile = open(small_filename, "w")
smallfile.write(line)
if smallfile:
smallfile.close()
使用itertools
石斑魚配方:
from itertools import zip_longest
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return zip_longest(fillvalue=fillvalue, *args)
n = 300
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
with open('small_file_{0}'.format(i * n), 'w') as fout:
fout.writelines(g)
與將每一行存儲在列表中相比,這種方法的優點是它可以逐行處理可迭代對象,因此不必一次將每個small_file
存儲到內存中。
請注意,在這種情況下,最后一個文件將是small_file_100200
但只會持續到line 100000
。 這是因為fillvalue=''
,意思是我沒有寫出來的文件時,我沒有留下來寫,因為一組大小不平分任何更多的線路。 您可以通過寫入臨時文件然后重命名它來解決此問題,而不是像我那樣先命名它。 這是如何做到的。
import os, tempfile
with open('really_big_file.txt') as f:
for i, g in enumerate(grouper(n, f, fillvalue=None)):
with tempfile.NamedTemporaryFile('w', delete=False) as fout:
for j, line in enumerate(g, 1): # count number of lines in group
if line is None:
j -= 1 # don't count this line
break
fout.write(line)
os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))
這次fillvalue=None
並且我檢查每一行是否有None
,當它發生時,我知道這個過程已經完成,所以我從j
減去1
以不計算填充物,然后寫入文件。
我這樣做是一種更容易理解的方式,並使用較少的捷徑,以便讓您進一步了解它的工作原理和原因。 以前的答案有效,但如果您不熟悉某些內置函數,您將無法理解該函數在做什么。
因為你沒有發布任何代碼,所以我決定這樣做,因為你可能不熟悉基本 python 語法以外的東西,因為你提出問題的方式讓人感覺好像你沒有嘗試過,也不知道如何處理問題
以下是在基本 python 中執行此操作的步驟:
首先,您應該將文件讀入一個列表以進行妥善保管:
my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
其次,您需要設置一種按名稱創建新文件的方法! 我建議一個循環和幾個計數器:
outer_count = 1
line_count = 0
sorting = True
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
第三,在該循環中,您需要一些嵌套循環來將正確的行保存到數組中:
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
最后一件事,再次在您的第一個循環中,您需要寫入新文件並添加最后一個計數器增量,以便您的循環將再次通過並寫入一個新文件
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
注意:如果行數不能被 300 整除,則最后一個文件的名稱將與最后一個文件行不對應。
了解為什么這些循環起作用很重要。 您設置了它,以便在下一個循環中,您寫入的文件的名稱會發生變化,因為您的名稱依賴於一個不斷變化的變量。 這是一個非常有用的腳本工具,用於文件訪問、打開、寫入、組織等。
如果您無法遵循 what 循環中的內容,這里是整個函數:
my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
for row in text_file:
hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
count = 0
increment = (outer_count-1) * 300
left = len(hold_lines) - increment
file_name = "small_file_" + str(outer_count * 300) + ".txt"
hold_new_lines = []
if left < 300:
while count < left:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
sorting = False
else:
while count < 300:
hold_new_lines.append(hold_lines[line_count])
count += 1
line_count += 1
outer_count += 1
with open(file_name,'w') as next_file:
for row in hold_new_lines:
next_file.write(row)
lines_per_file = 300 # Lines on each small file
lines = [] # Stores lines not yet written on a small file
lines_counter = 0 # Same as len(lines)
created_files = 0 # Counting how many small files have been created
with open('really_big_file.txt') as big_file:
for line in big_file: # Go throught the whole big file
lines.append(line)
lines_counter += 1
if lines_counter == lines_per_file:
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write all lines on small file
small_file.write('\n'.join(stored_lines))
lines = [] # Reset variables
lines_counter = 0
created_files += 1 # One more small file has been created
# After for-loop has finished
if lines_counter: # There are still some lines not written on a file?
idx = lines_per_file * (created_files + 1)
with open('small_file_%s.txt' % idx, 'w') as small_file:
# Write them on a last small file
small_file.write('n'.join(stored_lines))
created_files += 1
print '%s small files (with %s lines each) were created.' % (created_files,
lines_per_file)
import csv
import os
import re
MAX_CHUNKS = 300
def writeRow(idr, row):
with open("file_%d.csv" % idr, 'ab') as file:
writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
writer.writerow(row)
def cleanup():
for f in os.listdir("."):
if re.search("file_.*", f):
os.remove(os.path.join(".", f))
def main():
cleanup()
with open("large_file.csv", 'rb') as results:
r = csv.reader(results, delimiter=',', quotechar='\"')
idr = 1
for i, x in enumerate(r):
temp = i + 1
if not (temp % (MAX_CHUNKS + 1)):
idr += 1
writeRow(idr, x)
if __name__ == "__main__": main()
如果您想將其拆分為 2 個文件,則有一個非常簡單的方法,例如:
with open("myInputFile.txt",'r') as file:
lines = file.readlines()
with open("OutputFile1.txt",'w') as file:
for line in lines[:int(len(lines)/2)]:
file.write(line)
with open("OutputFile2.txt",'w') as file:
for line in lines[int(len(lines)/2):]:
file.write(line)
使這種動態將是:
with open("inputFile.txt",'r') as file:
lines = file.readlines()
Batch = 10
end = 0
for i in range(1,Batch + 1):
if i == 1:
start = 0
increase = int(len(lines)/Batch)
end = end + increase
with open("splitText_" + str(i) + ".txt",'w') as file:
for line in lines[start:end]:
file.write(line)
start = end
with open('/really_big_file.txt') as infile:
file_line_limit = 300
counter = -1
file_index = 0
outfile = None
for line in infile.readlines():
counter += 1
if counter % file_line_limit == 0:
# close old file
if outfile is not None:
outfile.close()
# create new file
file_index += 1
outfile = open('small_file_%03d.txt' % file_index, 'w')
# write to file
outfile.write(line)
我不得不對 650000 個行文件做同樣的事情。
使用枚舉索引和整數 div it (//) 與塊大小
當該數字更改時關閉當前文件並打開一個新文件
這是一個使用格式字符串的 python3 解決方案。
chunk = 50000 # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')
with open('massive_web_log_file') as file_to_read:
for i, line in enumerate(file_to_read.readlines()):
file_name = f'./a_folder/{i // chunk}'
print(i, file_name) # a bit of feedback that slows the process down a
if file_name == this_small_file.name:
this_small_file.write(line)
else:
this_small_file.write(line)
this_small_file.close()
this_small_file = open(f'{file_name}', 'a')
將文件設置為要將主文件拆分為的文件數,在我的示例中,我想從主文件中獲取 10 個文件
files = 10
with open("data.txt","r") as data :
emails = data.readlines()
batchs = int(len(emails)/10)
for id,log in enumerate(emails):
fileid = id/batchs
file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
file.write(log)
在 Python 文件中是簡單的迭代器。 這提供了對它們進行多次迭代的選項,並且總是從前一個迭代器獲得的最后一個位置繼續。 記住這一點,我們可以使用islice
在連續循環中每次獲取文件的下 300 行。 棘手的部分是知道何時停止。 為此,我們將為next
行“采樣”文件,一旦用盡,我們就可以break
循環:
from itertools import islice
lines_per_file = 300
with open("really_big_file.txt") as file:
i = 1
while True:
try:
checker = next(file)
except StopIteration:
break
with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file:
out_file.write(checker)
for line in islice(file, lines_per_file-1):
out_file.write(line)
i += 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.