简体   繁体   English

Python3如何根据行内容将大文本文件拆分为较小的文件

[英]Python3 How to split a large text file into smaller files based on line content

I have a file with the data我有一个包含数据的文件

# FULL_ID BJD MAG UNCERT FLAG # FULL_ID BJD MAG 不确定标志

and nearly 12,000 rows.和近 12,000 行。 This table contains data for 32 objects, each identified by a unique FULL_ID.此表包含 32 个对象的数据,每个对象由唯一的 FULL_ID 标识。 So for instance it may say例如,它可能会说

# FULL_ID   BJD        MAG      UNCERT      FLAG
  2_543     3215.52    19.78    0.02937     OO
  2_543     3215.84    19.42    0.02231     OO
  3_522     3215.52    15.43    0.01122     OO
  3_522     3222.22    16.12    0.01223     OO

What I want is to run this file BigData.dat through the code, and end up with multiple files eg 2_543.dat , 3_522.dat etc, each containing:我想要的是通过代码运行这个文件BigData.dat ,并最终得到多个文件,例如2_543.dat3_522.dat等,每个文件都包含:

# BJD    MAG    UNCERT    FLAG

for all rows of BigData.dat that belonged to that FULL_ID.对于属于该 FULL_ID 的所有BigData.dat行。

Currently I'm doing this:目前我正在这样做:

with open(path, 'r') as BigFile:
    line = BigFile.readline()
    for line in BigFile:
        fields = line.split(None)
        id = fields[0]
        output = open(id+".dat", 'a')
        writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
        output.write(writeline)
    output.close()

which does produce the correct outputs but they don't have the header line: # BJD MAG UNCERT FLAG确实会产生正确的输出,但它们没有 header 行: # BJD MAG UNCERT FLAG

How can I ensure this line is at the top of each file?如何确保此行位于每个文件的顶部?

You are overwriting the header line in the for loop, keep it in a separate variable.您正在覆盖 for 循环中的 header 行,将其保存在单独的变量中。 Additionally you could remember if the header was already written to a file:此外,您可能还记得 header 是否已写入文件:

path = 'big.dat'
header_written = []

with open(path, 'r') as BigFile:
    header = BigFile.readline()  # keep header separately!
    for line in BigFile:
        fields = line.split(None)
        _id = fields[0]
        output = open(_id+".dat", 'a')
        if _id not in header_written:  # check and save the ID to keep track if header was written
            output.write(header)
            header_written.append(_id)
        writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
        output.write(writeline)
        output.close()

File:文件:

# FULL_ID   BJD        MAG      UNCERT      FLAG
3215.52 19.78 0.02937 OO
3215.84 19.42 0.02231 OO

Opening a file is an expensive operation, and repeatedly doing so for each input line is not efficient.打开文件是一项昂贵的操作,并且对每个输入行重复这样做效率不高。 I would instead keep a mapping of seen FULL_ID values to a file object.相反,我会将看到的 FULL_ID 值映射到文件 object。 If a FULL_ID is not present, then the file has to be opened in "w" mode and the header should be immediately added.如果 FULL_ID 不存在,则必须以"w"模式打开文件,并且应立即添加 header。 This way:这边走:

  1. the header is correctly written to the output files header 已正确写入 output 文件
  2. if the script is runned more than once, the old values in output files are correctly erased如果脚本多次运行,output 文件中的旧值将被正确擦除

Code could be:代码可以是:

with open(path) as bigFile:
    outfiles = {}         # mapping FULL_ID -> output file
    header = ' '.join(['#'] + next(bigFile).split()[2:])   # compute output header
    for line in bigFile:
        row = line.split()
        try:
            output = outfiles[row[0]]
        except KeyError:
            output = open(f'{row[0]}.dat', 'w')
            print(header, file=output)
            outfiles[row[0]] = output
        print(' '.join(row[1:]), file=output)
    for output in outfiles.values():               # close all files before exiting
        output.close()

The limit is that you have to keep all files opened until the end of the input file.限制是您必须保持所有文件打开,直到输入文件结束。 It should word for 32 objects, but would break for larger numbers.它应该表示 32 个对象,但会中断更大的数字。 The efficient way would be to change the simple dict into a more sophisticated cache, able of closing the latest file when capacity is exhausted and reopen it (in append mode) if needed.有效的方法是将简单的 dict 更改为更复杂的缓存,能够在容量耗尽时关闭最新文件并在需要时重新打开它(在 append 模式下)。


Here is a possible cache implementation:这是一个可能的缓存实现:

class FileCache:
    """Caches a number of open files referenced by string Ids.
    (by default the id is the name)"""
    def __init__(self, size, namemapping=None, header=None):
        """Creates a new cache of size size.
        namemapping is a function that gives the filename from an ID
        header is an optional header that will be written at creation
        time
        """
        self.size = size
        self.namemapping = namemapping if namemapping is not None \
            else lambda x: x
        self.header = header
        self.map = {}             # dict id -> slot number
        self.slots = [(None, None)] * size   # list of pairs (id, file object)
        self.curslot = 0          # next slot to be used

    def getFile(self, id):
        """Gets an open file from the cache.
        Directly gets it if it is already present, eventually reopen
        it in append mode. Adds it to the cache if absent and open it
        in truncate mode."""
        try:
            slot = self.map[id]
            if slot != -1:
                return self.slots[slot][1]   # found and active
            mode = 'a'                       # need re-opening
        except:
            mode = 'w'                       # new id: create file
        slot = self.curslot
        self.curslot = (slot + 1) % self.size
        if self.slots[slot][0] is not None:  # eventually close previous
            self.slots[slot][1].close()
            self.map[self.slots[slot][0]] = -1
        fd = open(self.namemapping(id), mode)
        # if file is new, write the optional header
        if (mode == 'w') and self.header is not None:
            print(self.header, file=fd)
        self.slots[slot] = (id, fd)
        self.map[id] = slot
        return fd

    def close(self):
        """Closes any cached file."""
        for i in self.slots:
            i[1].close()
            self.map[i[0]] = -1
        self.slots = [(None, None)] * self.size

Above code would become:上面的代码将变为:

with open(path) as bigFile:
    header = ' '.join(['#'] + next(bigFile).split()[2:])   # compute output header
    outfiles = FileCache(10, lambda x: x+'.dat', header) # cache FULL_ID -> file
    for line in bigFile:
        row = line.split()
        output = outfiles.getFile(row[0])
        print(' '.join(row[1:]), file=output)
    outfiles.close()               # close all files before exiting

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何根据正则表达式模式将文本文件拆分为较小的文件? - How to split a text file into smaller files based on regex pattern? 使用 Python 按行号将大文本文件拆分为较小的文本文件 - Splitting large text file into smaller text files by line numbers using Python 我如何通过多处理 python 将大型 json 文件拆分为较小的 json 文件 - How i can split large json file into smaller json files through multiprocessing python 将一个大的 json 文件拆分成多个较小的文件 - Split a large json file into multiple smaller files 根据python的标题将大尺寸文本文件拆分为较小的文件 - splitting of large size text file into smaller ones based on heading in python 在Python中将大型文件拆分为较小的文件-打开的文件过多 - Split really large file into smaller files in Python - Too many open files 根据Python中的值差异将列文本文件拆分为较小的文件 - Splitting column text file into smaller files based on value differences in Python 根据每行的内容将 PDB 文本文件拆分为多个文件 - Split a PDB text file into a number of files based on contents of each line PYTHON:上传文本文件,然后每 n 行拆分成较小的文本文件 - PYTHON: Upload text file then split into smaller text files in every n lines 如何使用python将ldif拆分为较小的文件? - how to split ldif in smaller files using python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM