簡體   English   中英

使用 Python 按行號將大文本文件拆分為較小的文本文件

[英]Splitting large text file into smaller text files by line numbers using Python

我有一個文本文件 really_big_file.txt 包含:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想編寫一個 Python 腳本,將 really_big_file.txt 分成較小的文件,每個文件 300 行。 例如,small_file_300.txt 有 1-300 行,small_file_600 有 301-600 行,等等,直到有足夠的小文件來包含大文件中的所有行。

對於使用 Python 完成此操作的最簡單方法的任何建議,我將不勝感激

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

使用itertools石斑魚配方:

from itertools import zip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

與將每一行存儲在列表中相比,這種方法的優點是它可以逐行處理可迭代對象,因此不必一次將每個small_file存儲到內存中。

請注意,在這種情況下,最后一個文件將是small_file_100200但只會持續到line 100000 這是因為fillvalue='' ,意思是我沒有寫出來的文件時,我沒有留下來寫,因為一組大小不平分任何更多的線路。 您可以通過寫入臨時文件然后重命名它來解決此問題,而不是像我那樣先命名它。 這是如何做到的。

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

這次fillvalue=None並且我檢查每一行是否有None ,當它發生時,我知道這個過程已經完成,所以我從j減去1以不計算填充物,然后寫入文件。

我這樣做是一種更容易理解的方式,並使用較少的捷徑,以便讓您進一步了解它的工作原理和原因。 以前的答案有效,但如果您不熟悉某些內置函數,您將無法理解該函數在做什么。

因為你沒有發布任何代碼,所以我決定這樣做,因為你可能不熟悉基本 python 語法以外的東西,因為你提出問題的方式讓人感覺好像你沒有嘗試過,也不知道如何處理問題

以下是在基本 python 中執行此操作的步驟:

首先,您應該將文件讀入一個列表以進行妥善保管:

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

其次,您需要設置一種按名稱創建新文件的方法! 我建議一個循環和幾個計數器:

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

第三,在該循環中,您需要一些嵌套循環來將正確的行保存到數組中:

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

最后一件事,再次在您的第一個循環中,您需要寫入新文件並添加最后一個計數器增量,以便您的循環將再次通過並寫入一個新文件

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

注意:如果行數不能被 300 整除,則最后一個文件的名稱將與最后一個文件行不對應。

了解為什么這些循環起作用很重要。 您設置了它,以便在下一個循環中,您寫入的文件的名稱會發生​​變化,因為您的名稱依賴於一個不斷變化的變量。 這是一個非常有用的腳本工具,用於文件訪問、打開、寫入、組織等。

如果您無法遵循 what 循環中的內容,這里是整個函數:

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)
lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)
import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()

如果您想將其拆分為 2 個文件,則有一個非常簡單的方法,例如:

with open("myInputFile.txt",'r') as file:
    lines = file.readlines()

with open("OutputFile1.txt",'w') as file:
    for line in lines[:int(len(lines)/2)]:
        file.write(line)

with open("OutputFile2.txt",'w') as file:
    for line in lines[int(len(lines)/2):]:
        file.write(line)

使這種動態將是:

with open("inputFile.txt",'r') as file:
    lines = file.readlines()

Batch = 10
end = 0
for i in range(1,Batch + 1):
    if i == 1:
        start = 0
    increase = int(len(lines)/Batch)
    end = end + increase
    with open("splitText_" + str(i) + ".txt",'w') as file:
        for line in lines[start:end]:
            file.write(line)
    
    start = end
with open('/really_big_file.txt') as infile:
    file_line_limit = 300
    counter = -1
    file_index = 0
    outfile = None
    for line in infile.readlines():
        counter += 1
        if counter % file_line_limit == 0:
            # close old file
            if outfile is not None:
                outfile.close()
            # create new file
            file_index += 1
            outfile = open('small_file_%03d.txt' % file_index, 'w')
        # write to file
        outfile.write(line)

我不得不對 650000 個行文件做同樣的事情。

使用枚舉索引和整數 div it (//) 與塊大小

當該數字更改時關閉當前文件並打開一個新文件

這是一個使用格式字符串的 python3 解決方案。

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')

文件設置為要將主文件拆分為的文件數,在我的示例中,我想從主文件中獲取 10 個文件

files = 10
with open("data.txt","r") as data :
    emails = data.readlines()
    batchs = int(len(emails)/10)
    for id,log in enumerate(emails):
        fileid = id/batchs
        file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
        file.write(log)

在 Python 文件中是簡單的迭代器。 這提供了對它們進行多次迭代的選項,並且總是從前一個迭代器獲得的最后一個位置繼續。 記住這一點,我們可以使用islice在連續循環中每次獲取文件的下 300 行。 棘手的部分是知道何時停止。 為此,我們將為next行“采樣”文件,一旦用盡,我們就可以break循環:

from itertools import islice

lines_per_file = 300
with open("really_big_file.txt") as file:
    i = 1
    while True:
        try:
            checker = next(file)
        except StopIteration:
            break
        with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file:
            out_file.write(checker)
            for line in islice(file, lines_per_file-1):
                out_file.write(line)
        i += 1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM