使用 Python 按行號將大文本文件拆分為較小的文本文件

Question

我有一個文本文件 really_big_file.txt 包含：

line 1
line 2
line 3
line 4
...
line 99999
line 100000

我想編寫一個 Python 腳本，將 really_big_file.txt 分成較小的文件，每個文件 300 行。 例如，small_file_300.txt 有 1-300 行，small_file_600 有 301-600 行，等等，直到有足夠的小文件來包含大文件中的所有行。

對於使用 Python 完成此操作的最簡單方法的任何建議，我將不勝感激

Answer 1

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

Answer 2

使用itertools石斑魚配方：

from itertools import zip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

與將每一行存儲在列表中相比，這種方法的優點是它可以逐行處理可迭代對象，因此不必一次將每個small_file存儲到內存中。

請注意，在這種情況下，最后一個文件將是small_file_100200但只會持續到line 100000 。 這是因為fillvalue='' ，意思是我沒有寫出來的文件時，我沒有留下來寫，因為一組大小不平分任何更多的線路。 您可以通過寫入臨時文件然后重命名它來解決此問題，而不是像我那樣先命名它。 這是如何做到的。

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

這次fillvalue=None並且我檢查每一行是否有None ，當它發生時，我知道這個過程已經完成，所以我從j減去1以不計算填充物，然后寫入文件。

Answer 3

我這樣做是一種更容易理解的方式，並使用較少的捷徑，以便讓您進一步了解它的工作原理和原因。 以前的答案有效，但如果您不熟悉某些內置函數，您將無法理解該函數在做什么。

因為你沒有發布任何代碼，所以我決定這樣做，因為你可能不熟悉基本 python 語法以外的東西，因為你提出問題的方式讓人感覺好像你沒有嘗試過，也不知道如何處理問題

以下是在基本 python 中執行此操作的步驟：

首先，您應該將文件讀入一個列表以進行妥善保管：

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

其次，您需要設置一種按名稱創建新文件的方法！ 我建議一個循環和幾個計數器：

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

第三，在該循環中，您需要一些嵌套循環來將正確的行保存到數組中：

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

最后一件事，再次在您的第一個循環中，您需要寫入新文件並添加最后一個計數器增量，以便您的循環將再次通過並寫入一個新文件

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

注意：如果行數不能被 300 整除，則最后一個文件的名稱將與最后一個文件行不對應。

了解為什么這些循環起作用很重要。 您設置了它，以便在下一個循環中，您寫入的文件的名稱會發生變化，因為您的名稱依賴於一個不斷變化的變量。 這是一個非常有用的腳本工具，用於文件訪問、打開、寫入、組織等。

如果您無法遵循 what 循環中的內容，這里是整個函數：

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)

Answer 4

lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)

Answer 5

import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()

Answer 6

如果您想將其拆分為 2 個文件，則有一個非常簡單的方法，例如：

with open("myInputFile.txt",'r') as file:
    lines = file.readlines()

with open("OutputFile1.txt",'w') as file:
    for line in lines[:int(len(lines)/2)]:
        file.write(line)

with open("OutputFile2.txt",'w') as file:
    for line in lines[int(len(lines)/2):]:
        file.write(line)

使這種動態將是：

with open("inputFile.txt",'r') as file:
    lines = file.readlines()

Batch = 10
end = 0
for i in range(1,Batch + 1):
    if i == 1:
        start = 0
    increase = int(len(lines)/Batch)
    end = end + increase
    with open("splitText_" + str(i) + ".txt",'w') as file:
        for line in lines[start:end]:
            file.write(line)
    
    start = end

Answer 7

with open('/really_big_file.txt') as infile:
    file_line_limit = 300
    counter = -1
    file_index = 0
    outfile = None
    for line in infile.readlines():
        counter += 1
        if counter % file_line_limit == 0:
            # close old file
            if outfile is not None:
                outfile.close()
            # create new file
            file_index += 1
            outfile = open('small_file_%03d.txt' % file_index, 'w')
        # write to file
        outfile.write(line)

Answer 8

我不得不對 650000 個行文件做同樣的事情。

使用枚舉索引和整數 div it (//) 與塊大小

當該數字更改時關閉當前文件並打開一個新文件

這是一個使用格式字符串的 python3 解決方案。

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')

Answer 9

將文件設置為要將主文件拆分為的文件數，在我的示例中，我想從主文件中獲取 10 個文件

files = 10
with open("data.txt","r") as data :
    emails = data.readlines()
    batchs = int(len(emails)/10)
    for id,log in enumerate(emails):
        fileid = id/batchs
        file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
        file.write(log)

Answer 10

在 Python 文件中是簡單的迭代器。 這提供了對它們進行多次迭代的選項，並且總是從前一個迭代器獲得的最后一個位置繼續。 記住這一點，我們可以使用islice在連續循環中每次獲取文件的下 300 行。 棘手的部分是知道何時停止。 為此，我們將為next行“采樣”文件，一旦用盡，我們就可以break循環：

from itertools import islice

lines_per_file = 300
with open("really_big_file.txt") as file:
    i = 1
    while True:
        try:
            checker = next(file)
        except StopIteration:
            break
        with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file:
            out_file.write(checker)
            for line in islice(file, lines_per_file-1):
                out_file.write(line)
        i += 1

使用 Python 按行號將大文本文件拆分為較小的文本文件

問題描述

10 個解決方案

解決方案1
49 2013-04-30 01:35:50

解決方案2
30 2013-04-29 23:31:46

解決方案3
4 2013-04-30 01:08:29

解決方案4
2 2013-04-30 00:21:05

解決方案5
2 2015-06-25 19:48:18

解決方案6
1 2021-10-20 07:12:22

解決方案7
1 2022-02-10 09:26:57

解決方案8
0 2018-11-28 05:07:39

解決方案9
0 2020-10-11 17:10:57

解決方案10
0 2022-01-12 11:00:16

使用 Python 按行號將大文本文件拆分為較小的文本文件

問題描述

10 個解決方案

解決方案1 49 2013-04-30 01:35:50

解決方案2 30 2013-04-29 23:31:46

解決方案3 4 2013-04-30 01:08:29

解決方案4 2 2013-04-30 00:21:05

解決方案5 2 2015-06-25 19:48:18

解決方案6 1 2021-10-20 07:12:22

解決方案7 1 2022-02-10 09:26:57

解決方案8 0 2018-11-28 05:07:39

解決方案9 0 2020-10-11 17:10:57

解決方案10 0 2022-01-12 11:00:16

解決方案1
49 2013-04-30 01:35:50

解決方案2
30 2013-04-29 23:31:46

解決方案3
4 2013-04-30 01:08:29

解決方案4
2 2013-04-30 00:21:05

解決方案5
2 2015-06-25 19:48:18

解決方案6
1 2021-10-20 07:12:22

解決方案7
1 2022-02-10 09:26:57

解決方案8
0 2018-11-28 05:07:39

解決方案9
0 2020-10-11 17:10:57

解決方案10
0 2022-01-12 11:00:16