简体   繁体   English

使用 Python 按行号将大文本文件拆分为较小的文本文件

[英]Splitting large text file into smaller text files by line numbers using Python

I have a text file say really_big_file.txt that contains:我有一个文本文件 really_big_file.txt 包含:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

I would like to write a Python script that divides really_big_file.txt into smaller files with 300 lines each.我想编写一个 Python 脚本,将 really_big_file.txt 分成较小的文件,每个文件 300 行。 For example, small_file_300.txt to have lines 1-300, small_file_600 to have lines 301-600, and so on until there are enough small files made to contain all the lines from the big file.例如,small_file_300.txt 有 1-300 行,small_file_600 有 301-600 行,等等,直到有足够的小文件来包含大文件中的所有行。

I would appreciate any suggestions on the easiest way to accomplish this using Python对于使用 Python 完成此操作的最简单方法的任何建议,我将不胜感激

lines_per_file = 300
smallfile = None
with open('really_big_file.txt') as bigfile:
    for lineno, line in enumerate(bigfile):
        if lineno % lines_per_file == 0:
            if smallfile:
                smallfile.close()
            small_filename = 'small_file_{}.txt'.format(lineno + lines_per_file)
            smallfile = open(small_filename, "w")
        smallfile.write(line)
    if smallfile:
        smallfile.close()

Using itertools grouper recipe:使用itertools石斑鱼配方:

from itertools import zip_longest

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return zip_longest(fillvalue=fillvalue, *args)

n = 300

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
        with open('small_file_{0}'.format(i * n), 'w') as fout:
            fout.writelines(g)

The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.与将每一行存储在列表中相比,这种方法的优点是它可以逐行处理可迭代对象,因此不必一次将每个small_file存储到内存中。

Note that the last file in this case will be small_file_100200 but will only go until line 100000 .请注意,在这种情况下,最后一个文件将是small_file_100200但只会持续到line 100000 This happens because fillvalue='' , meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally.这是因为fillvalue='' ,意思是我没有写出来的文件时,我没有留下来写,因为一组大小不平分任何更多的线路。 You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have.您可以通过写入临时文件然后重命名它来解决此问题,而不是像我那样先命名它。 Here's how that can be done.这是如何做到的。

import os, tempfile

with open('really_big_file.txt') as f:
    for i, g in enumerate(grouper(n, f, fillvalue=None)):
        with tempfile.NamedTemporaryFile('w', delete=False) as fout:
            for j, line in enumerate(g, 1): # count number of lines in group
                if line is None:
                    j -= 1 # don't count this line
                    break
                fout.write(line)
        os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))

This time the fillvalue=None and I go through each line checking for None , when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.这次fillvalue=None并且我检查每一行是否有None ,当它发生时,我知道这个过程已经完成,所以我从j减去1以不计算填充物,然后写入文件。

I do this a more understandable way and using less short cuts in order to give you a further understanding of how and why this works.我这样做是一种更容易理解的方式,并使用较少的捷径,以便让您进一步了解它的工作原理和原因。 Previous answers work, but if you are not familiar with certain built-in-functions, you will not understand what the function is doing.以前的答案有效,但如果您不熟悉某些内置函数,您将无法理解该函数在做什么。

Because you posted no code I decided to do it this way since you could be unfamiliar with things other than basic python syntax given that the way you phrased the question made it seem as though you did not try nor had any clue as how to approach the question因为你没有发布任何代码,所以我决定这样做,因为你可能不熟悉基本 python 语法以外的东西,因为你提出问题的方式让人感觉好像你没有尝试过,也不知道如何处理问题

Here are the steps to do this in basic python:以下是在基本 python 中执行此操作的步骤:

First you should read your file into a list for safekeeping:首先,您应该将文件读入一个列表以进行妥善保管:

my_file = 'really_big_file.txt'
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)

Second, you need to set up a way of creating the new files by name!其次,您需要设置一种按名称创建新文件的方法! I would suggest a loop along with a couple counters:我建议一个循环和几个计数器:

outer_count = 1
line_count = 0
sorting = True
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"

Third, inside that loop you need some nested loops that will save the correct rows into an array:第三,在该循环中,您需要一些嵌套循环来将正确的行保存到数组中:

hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1

Last thing, again in your first loop you need to write the new file and add your last counter increment so your loop will go through again and write a new file最后一件事,再次在您的第一个循环中,您需要写入新文件并添加最后一个计数器增量,以便您的循环将再次通过并写入一个新文件

outer_count += 1
with open(file_name,'w') as next_file:
    for row in hold_new_lines:
        next_file.write(row)

note: if the number of lines is not divisible by 300, the last file will have a name that does not correspond to the last file line.注意:如果行数不能被 300 整除,则最后一个文件的名称将与最后一个文件行不对应。

It is important to understand why these loops work.了解为什么这些循环起作用很重要。 You have it set so that on the next loop, the name of the file that you write changes because you have the name dependent on a changing variable.您设置了它,以便在下一个循环中,您写入的文件的名称会发生​​变化,因为您的名称依赖于一个不断变化的变量。 This is a very useful scripting tool for file accessing, opening, writing, organizing etc.这是一个非常有用的脚本工具,用于文件访问、打开、写入、组织等。

In case you could not follow what was in what loop, here is the entirety of the function:如果您无法遵循 what 循环中的内容,这里是整个函数:

my_file = 'really_big_file.txt'
sorting = True
hold_lines = []
with open(my_file,'r') as text_file:
    for row in text_file:
        hold_lines.append(row)
outer_count = 1
line_count = 0
while sorting:
    count = 0
    increment = (outer_count-1) * 300
    left = len(hold_lines) - increment
    file_name = "small_file_" + str(outer_count * 300) + ".txt"
    hold_new_lines = []
    if left < 300:
        while count < left:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
        sorting = False
    else:
        while count < 300:
            hold_new_lines.append(hold_lines[line_count])
            count += 1
            line_count += 1
    outer_count += 1
    with open(file_name,'w') as next_file:
        for row in hold_new_lines:
            next_file.write(row)
lines_per_file = 300  # Lines on each small file
lines = []  # Stores lines not yet written on a small file
lines_counter = 0  # Same as len(lines)
created_files = 0  # Counting how many small files have been created

with open('really_big_file.txt') as big_file:
    for line in big_file:  # Go throught the whole big file
        lines.append(line)
        lines_counter += 1
        if lines_counter == lines_per_file:
            idx = lines_per_file * (created_files + 1)
            with open('small_file_%s.txt' % idx, 'w') as small_file:
                # Write all lines on small file
                small_file.write('\n'.join(stored_lines))
            lines = []  # Reset variables
            lines_counter = 0
            created_files += 1  # One more small file has been created
    # After for-loop has finished
    if lines_counter:  # There are still some lines not written on a file?
        idx = lines_per_file * (created_files + 1)
        with open('small_file_%s.txt' % idx, 'w') as small_file:
            # Write them on a last small file
            small_file.write('n'.join(stored_lines))
        created_files += 1

print '%s small files (with %s lines each) were created.' % (created_files,
                                                             lines_per_file)
import csv
import os
import re

MAX_CHUNKS = 300


def writeRow(idr, row):
    with open("file_%d.csv" % idr, 'ab') as file:
        writer = csv.writer(file, delimiter=',', quotechar='\"', quoting=csv.QUOTE_ALL)
        writer.writerow(row)

def cleanup():
    for f in os.listdir("."):
        if re.search("file_.*", f):
            os.remove(os.path.join(".", f))

def main():
    cleanup()
    with open("large_file.csv", 'rb') as results:
        r = csv.reader(results, delimiter=',', quotechar='\"')
        idr = 1
        for i, x in enumerate(r):
            temp = i + 1
            if not (temp % (MAX_CHUNKS + 1)):
                idr += 1
            writeRow(idr, x)

if __name__ == "__main__": main()

A very easy way would if you want to split it in 2 files for example:如果您想将其拆分为 2 个文件,则有一个非常简单的方法,例如:

with open("myInputFile.txt",'r') as file:
    lines = file.readlines()

with open("OutputFile1.txt",'w') as file:
    for line in lines[:int(len(lines)/2)]:
        file.write(line)

with open("OutputFile2.txt",'w') as file:
    for line in lines[int(len(lines)/2):]:
        file.write(line)

making that dynamic would be:使这种动态将是:

with open("inputFile.txt",'r') as file:
    lines = file.readlines()

Batch = 10
end = 0
for i in range(1,Batch + 1):
    if i == 1:
        start = 0
    increase = int(len(lines)/Batch)
    end = end + increase
    with open("splitText_" + str(i) + ".txt",'w') as file:
        for line in lines[start:end]:
            file.write(line)
    
    start = end
with open('/really_big_file.txt') as infile:
    file_line_limit = 300
    counter = -1
    file_index = 0
    outfile = None
    for line in infile.readlines():
        counter += 1
        if counter % file_line_limit == 0:
            # close old file
            if outfile is not None:
                outfile.close()
            # create new file
            file_index += 1
            outfile = open('small_file_%03d.txt' % file_index, 'w')
        # write to file
        outfile.write(line)

I had to do the same with 650000 line files.我不得不对 650000 个行文件做同样的事情。

Use the enumerate index and integer div it (//) with the chunk size使用枚举索引和整数 div it (//) 与块大小

When that number changes close the current file and open a new one当该数字更改时关闭当前文件并打开一个新文件

This is a python3 solution using format strings.这是一个使用格式字符串的 python3 解决方案。

chunk = 50000  # number of lines from the big file to put in small file
this_small_file = open('./a_folder/0', 'a')

with open('massive_web_log_file') as file_to_read:
    for i, line in enumerate(file_to_read.readlines()):
        file_name = f'./a_folder/{i // chunk}'
        print(i, file_name)  # a bit of feedback that slows the process down a

        if file_name == this_small_file.name:
            this_small_file.write(line)

        else:
            this_small_file.write(line)
            this_small_file.close()
            this_small_file = open(f'{file_name}', 'a')

Set files to the number of file you want to split the master file to in my exemple i want to get 10 files from my master file文件设置为要将主文件拆分为的文件数,在我的示例中,我想从主文件中获取 10 个文件

files = 10
with open("data.txt","r") as data :
    emails = data.readlines()
    batchs = int(len(emails)/10)
    for id,log in enumerate(emails):
        fileid = id/batchs
        file=open("minifile{file}.txt".format(file=int(fileid)+1),'a+')
        file.write(log)

In Python files are simple iterators.在 Python 文件中是简单的迭代器。 That gives the option to iterate over them multiple times and always continue from the last place the previous iterator got.这提供了对它们进行多次迭代的选项,并且总是从前一个迭代器获得的最后一个位置继续。 Keeping this in mind, we can use islice to get the next 300 lines of the file each time in a continuous loop.记住这一点,我们可以使用islice在连续循环中每次获取文件的下 300 行。 The tricky part is knowing when to stop.棘手的部分是知道何时停止。 For this we will "sample" the file for the next line and once it is exhausted we can break the loop:为此,我们将为next行“采样”文件,一旦用尽,我们就可以break循环:

from itertools import islice

lines_per_file = 300
with open("really_big_file.txt") as file:
    i = 1
    while True:
        try:
            checker = next(file)
        except StopIteration:
            break
        with open(f"small_file_{i*lines_per_file}.txt", 'w') as out_file:
            out_file.write(checker)
            for line in islice(file, lines_per_file-1):
                out_file.write(line)
        i += 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM