繁体   English   中英

Python3如何根据行内容将大文本文件拆分为较小的文件

[英]Python3 How to split a large text file into smaller files based on line content

我有一个包含数据的文件

# FULL_ID BJD MAG 不确定标志

和近 12,000 行。 此表包含 32 个对象的数据,每个对象由唯一的 FULL_ID 标识。 例如,它可能会说

# FULL_ID   BJD        MAG      UNCERT      FLAG
  2_543     3215.52    19.78    0.02937     OO
  2_543     3215.84    19.42    0.02231     OO
  3_522     3215.52    15.43    0.01122     OO
  3_522     3222.22    16.12    0.01223     OO

我想要的是通过代码运行这个文件BigData.dat ,并最终得到多个文件,例如2_543.dat3_522.dat等,每个文件都包含:

# BJD    MAG    UNCERT    FLAG

对于属于该 FULL_ID 的所有BigData.dat行。

目前我正在这样做:

with open(path, 'r') as BigFile:
    line = BigFile.readline()
    for line in BigFile:
        fields = line.split(None)
        id = fields[0]
        output = open(id+".dat", 'a')
        writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
        output.write(writeline)
    output.close()

确实会产生正确的输出,但它们没有 header 行: # BJD MAG UNCERT FLAG

如何确保此行位于每个文件的顶部?

您正在覆盖 for 循环中的 header 行,将其保存在单独的变量中。 此外,您可能还记得 header 是否已写入文件:

path = 'big.dat'
header_written = []

with open(path, 'r') as BigFile:
    header = BigFile.readline()  # keep header separately!
    for line in BigFile:
        fields = line.split(None)
        _id = fields[0]
        output = open(_id+".dat", 'a')
        if _id not in header_written:  # check and save the ID to keep track if header was written
            output.write(header)
            header_written.append(_id)
        writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
        output.write(writeline)
        output.close()

文件:

# FULL_ID   BJD        MAG      UNCERT      FLAG
3215.52 19.78 0.02937 OO
3215.84 19.42 0.02231 OO

打开文件是一项昂贵的操作,并且对每个输入行重复这样做效率不高。 相反,我会将看到的 FULL_ID 值映射到文件 object。 如果 FULL_ID 不存在,则必须以"w"模式打开文件,并且应立即添加 header。 这边走:

  1. header 已正确写入 output 文件
  2. 如果脚本多次运行,output 文件中的旧值将被正确擦除

代码可以是:

with open(path) as bigFile:
    outfiles = {}         # mapping FULL_ID -> output file
    header = ' '.join(['#'] + next(bigFile).split()[2:])   # compute output header
    for line in bigFile:
        row = line.split()
        try:
            output = outfiles[row[0]]
        except KeyError:
            output = open(f'{row[0]}.dat', 'w')
            print(header, file=output)
            outfiles[row[0]] = output
        print(' '.join(row[1:]), file=output)
    for output in outfiles.values():               # close all files before exiting
        output.close()

限制是您必须保持所有文件打开,直到输入文件结束。 它应该表示 32 个对象,但会中断更大的数字。 有效的方法是将简单的 dict 更改为更复杂的缓存,能够在容量耗尽时关闭最新文件并在需要时重新打开它(在 append 模式下)。


这是一个可能的缓存实现:

class FileCache:
    """Caches a number of open files referenced by string Ids.
    (by default the id is the name)"""
    def __init__(self, size, namemapping=None, header=None):
        """Creates a new cache of size size.
        namemapping is a function that gives the filename from an ID
        header is an optional header that will be written at creation
        time
        """
        self.size = size
        self.namemapping = namemapping if namemapping is not None \
            else lambda x: x
        self.header = header
        self.map = {}             # dict id -> slot number
        self.slots = [(None, None)] * size   # list of pairs (id, file object)
        self.curslot = 0          # next slot to be used

    def getFile(self, id):
        """Gets an open file from the cache.
        Directly gets it if it is already present, eventually reopen
        it in append mode. Adds it to the cache if absent and open it
        in truncate mode."""
        try:
            slot = self.map[id]
            if slot != -1:
                return self.slots[slot][1]   # found and active
            mode = 'a'                       # need re-opening
        except:
            mode = 'w'                       # new id: create file
        slot = self.curslot
        self.curslot = (slot + 1) % self.size
        if self.slots[slot][0] is not None:  # eventually close previous
            self.slots[slot][1].close()
            self.map[self.slots[slot][0]] = -1
        fd = open(self.namemapping(id), mode)
        # if file is new, write the optional header
        if (mode == 'w') and self.header is not None:
            print(self.header, file=fd)
        self.slots[slot] = (id, fd)
        self.map[id] = slot
        return fd

    def close(self):
        """Closes any cached file."""
        for i in self.slots:
            i[1].close()
            self.map[i[0]] = -1
        self.slots = [(None, None)] * self.size

上面的代码将变为:

with open(path) as bigFile:
    header = ' '.join(['#'] + next(bigFile).split()[2:])   # compute output header
    outfiles = FileCache(10, lambda x: x+'.dat', header) # cache FULL_ID -> file
    for line in bigFile:
        row = line.split()
        output = outfiles.getFile(row[0])
        print(' '.join(row[1:]), file=output)
    outfiles.close()               # close all files before exiting

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM