[英]Python3 How to split a large text file into smaller files based on line content
我有一个包含数据的文件
# FULL_ID BJD MAG 不确定标志
和近 12,000 行。 此表包含 32 个对象的数据,每个对象由唯一的 FULL_ID 标识。 例如,它可能会说
# FULL_ID BJD MAG UNCERT FLAG
2_543 3215.52 19.78 0.02937 OO
2_543 3215.84 19.42 0.02231 OO
3_522 3215.52 15.43 0.01122 OO
3_522 3222.22 16.12 0.01223 OO
我想要的是通过代码运行这个文件BigData.dat
,并最终得到多个文件,例如2_543.dat
、 3_522.dat
等,每个文件都包含:
# BJD MAG UNCERT FLAG
对于属于该 FULL_ID 的所有BigData.dat
行。
目前我正在这样做:
with open(path, 'r') as BigFile:
line = BigFile.readline()
for line in BigFile:
fields = line.split(None)
id = fields[0]
output = open(id+".dat", 'a')
writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
output.write(writeline)
output.close()
确实会产生正确的输出,但它们没有 header 行: # BJD MAG UNCERT FLAG
如何确保此行位于每个文件的顶部?
您正在覆盖 for 循环中的 header 行,将其保存在单独的变量中。 此外,您可能还记得 header 是否已写入文件:
path = 'big.dat'
header_written = []
with open(path, 'r') as BigFile:
header = BigFile.readline() # keep header separately!
for line in BigFile:
fields = line.split(None)
_id = fields[0]
output = open(_id+".dat", 'a')
if _id not in header_written: # check and save the ID to keep track if header was written
output.write(header)
header_written.append(_id)
writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
output.write(writeline)
output.close()
文件:
# FULL_ID BJD MAG UNCERT FLAG
3215.52 19.78 0.02937 OO
3215.84 19.42 0.02231 OO
打开文件是一项昂贵的操作,并且对每个输入行重复这样做效率不高。 相反,我会将看到的 FULL_ID 值映射到文件 object。 如果 FULL_ID 不存在,则必须以"w"
模式打开文件,并且应立即添加 header。 这边走:
代码可以是:
with open(path) as bigFile:
outfiles = {} # mapping FULL_ID -> output file
header = ' '.join(['#'] + next(bigFile).split()[2:]) # compute output header
for line in bigFile:
row = line.split()
try:
output = outfiles[row[0]]
except KeyError:
output = open(f'{row[0]}.dat', 'w')
print(header, file=output)
outfiles[row[0]] = output
print(' '.join(row[1:]), file=output)
for output in outfiles.values(): # close all files before exiting
output.close()
限制是您必须保持所有文件打开,直到输入文件结束。 它应该表示 32 个对象,但会中断更大的数字。 有效的方法是将简单的 dict 更改为更复杂的缓存,能够在容量耗尽时关闭最新文件并在需要时重新打开它(在 append 模式下)。
这是一个可能的缓存实现:
class FileCache:
"""Caches a number of open files referenced by string Ids.
(by default the id is the name)"""
def __init__(self, size, namemapping=None, header=None):
"""Creates a new cache of size size.
namemapping is a function that gives the filename from an ID
header is an optional header that will be written at creation
time
"""
self.size = size
self.namemapping = namemapping if namemapping is not None \
else lambda x: x
self.header = header
self.map = {} # dict id -> slot number
self.slots = [(None, None)] * size # list of pairs (id, file object)
self.curslot = 0 # next slot to be used
def getFile(self, id):
"""Gets an open file from the cache.
Directly gets it if it is already present, eventually reopen
it in append mode. Adds it to the cache if absent and open it
in truncate mode."""
try:
slot = self.map[id]
if slot != -1:
return self.slots[slot][1] # found and active
mode = 'a' # need re-opening
except:
mode = 'w' # new id: create file
slot = self.curslot
self.curslot = (slot + 1) % self.size
if self.slots[slot][0] is not None: # eventually close previous
self.slots[slot][1].close()
self.map[self.slots[slot][0]] = -1
fd = open(self.namemapping(id), mode)
# if file is new, write the optional header
if (mode == 'w') and self.header is not None:
print(self.header, file=fd)
self.slots[slot] = (id, fd)
self.map[id] = slot
return fd
def close(self):
"""Closes any cached file."""
for i in self.slots:
i[1].close()
self.map[i[0]] = -1
self.slots = [(None, None)] * self.size
上面的代码将变为:
with open(path) as bigFile:
header = ' '.join(['#'] + next(bigFile).split()[2:]) # compute output header
outfiles = FileCache(10, lambda x: x+'.dat', header) # cache FULL_ID -> file
for line in bigFile:
row = line.split()
output = outfiles.getFile(row[0])
print(' '.join(row[1:]), file=output)
outfiles.close() # close all files before exiting
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.