[英]Python3 How to split a large text file into smaller files based on line content
I have a file with the data我有一个包含数据的文件
# FULL_ID BJD MAG UNCERT FLAG # FULL_ID BJD MAG 不确定标志
and nearly 12,000 rows.和近 12,000 行。 This table contains data for 32 objects, each identified by a unique FULL_ID.
此表包含 32 个对象的数据,每个对象由唯一的 FULL_ID 标识。 So for instance it may say
例如,它可能会说
# FULL_ID BJD MAG UNCERT FLAG
2_543 3215.52 19.78 0.02937 OO
2_543 3215.84 19.42 0.02231 OO
3_522 3215.52 15.43 0.01122 OO
3_522 3222.22 16.12 0.01223 OO
What I want is to run this file BigData.dat
through the code, and end up with multiple files eg 2_543.dat
, 3_522.dat
etc, each containing:我想要的是通过代码运行这个文件
BigData.dat
,并最终得到多个文件,例如2_543.dat
、 3_522.dat
等,每个文件都包含:
# BJD MAG UNCERT FLAG
for all rows of BigData.dat
that belonged to that FULL_ID.对于属于该 FULL_ID 的所有
BigData.dat
行。
Currently I'm doing this:目前我正在这样做:
with open(path, 'r') as BigFile:
line = BigFile.readline()
for line in BigFile:
fields = line.split(None)
id = fields[0]
output = open(id+".dat", 'a')
writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
output.write(writeline)
output.close()
which does produce the correct outputs but they don't have the header line: # BJD MAG UNCERT FLAG
确实会产生正确的输出,但它们没有 header 行:
# BJD MAG UNCERT FLAG
How can I ensure this line is at the top of each file?如何确保此行位于每个文件的顶部?
You are overwriting the header line in the for loop, keep it in a separate variable.您正在覆盖 for 循环中的 header 行,将其保存在单独的变量中。 Additionally you could remember if the header was already written to a file:
此外,您可能还记得 header 是否已写入文件:
path = 'big.dat'
header_written = []
with open(path, 'r') as BigFile:
header = BigFile.readline() # keep header separately!
for line in BigFile:
fields = line.split(None)
_id = fields[0]
output = open(_id+".dat", 'a')
if _id not in header_written: # check and save the ID to keep track if header was written
output.write(header)
header_written.append(_id)
writeline = str(fields[1])+' '+str(fields[2])+' '+str(fields[3])+' '+str(fields[4])+'\n'
output.write(writeline)
output.close()
File:文件:
# FULL_ID BJD MAG UNCERT FLAG
3215.52 19.78 0.02937 OO
3215.84 19.42 0.02231 OO
Opening a file is an expensive operation, and repeatedly doing so for each input line is not efficient.打开文件是一项昂贵的操作,并且对每个输入行重复这样做效率不高。 I would instead keep a mapping of seen FULL_ID values to a file object.
相反,我会将看到的 FULL_ID 值映射到文件 object。 If a FULL_ID is not present, then the file has to be opened in
"w"
mode and the header should be immediately added.如果 FULL_ID 不存在,则必须以
"w"
模式打开文件,并且应立即添加 header。 This way:这边走:
Code could be:代码可以是:
with open(path) as bigFile:
outfiles = {} # mapping FULL_ID -> output file
header = ' '.join(['#'] + next(bigFile).split()[2:]) # compute output header
for line in bigFile:
row = line.split()
try:
output = outfiles[row[0]]
except KeyError:
output = open(f'{row[0]}.dat', 'w')
print(header, file=output)
outfiles[row[0]] = output
print(' '.join(row[1:]), file=output)
for output in outfiles.values(): # close all files before exiting
output.close()
The limit is that you have to keep all files opened until the end of the input file.限制是您必须保持所有文件打开,直到输入文件结束。 It should word for 32 objects, but would break for larger numbers.
它应该表示 32 个对象,但会中断更大的数字。 The efficient way would be to change the simple dict into a more sophisticated cache, able of closing the latest file when capacity is exhausted and reopen it (in append mode) if needed.
有效的方法是将简单的 dict 更改为更复杂的缓存,能够在容量耗尽时关闭最新文件并在需要时重新打开它(在 append 模式下)。
Here is a possible cache implementation:这是一个可能的缓存实现:
class FileCache:
"""Caches a number of open files referenced by string Ids.
(by default the id is the name)"""
def __init__(self, size, namemapping=None, header=None):
"""Creates a new cache of size size.
namemapping is a function that gives the filename from an ID
header is an optional header that will be written at creation
time
"""
self.size = size
self.namemapping = namemapping if namemapping is not None \
else lambda x: x
self.header = header
self.map = {} # dict id -> slot number
self.slots = [(None, None)] * size # list of pairs (id, file object)
self.curslot = 0 # next slot to be used
def getFile(self, id):
"""Gets an open file from the cache.
Directly gets it if it is already present, eventually reopen
it in append mode. Adds it to the cache if absent and open it
in truncate mode."""
try:
slot = self.map[id]
if slot != -1:
return self.slots[slot][1] # found and active
mode = 'a' # need re-opening
except:
mode = 'w' # new id: create file
slot = self.curslot
self.curslot = (slot + 1) % self.size
if self.slots[slot][0] is not None: # eventually close previous
self.slots[slot][1].close()
self.map[self.slots[slot][0]] = -1
fd = open(self.namemapping(id), mode)
# if file is new, write the optional header
if (mode == 'w') and self.header is not None:
print(self.header, file=fd)
self.slots[slot] = (id, fd)
self.map[id] = slot
return fd
def close(self):
"""Closes any cached file."""
for i in self.slots:
i[1].close()
self.map[i[0]] = -1
self.slots = [(None, None)] * self.size
Above code would become:上面的代码将变为:
with open(path) as bigFile:
header = ' '.join(['#'] + next(bigFile).split()[2:]) # compute output header
outfiles = FileCache(10, lambda x: x+'.dat', header) # cache FULL_ID -> file
for line in bigFile:
row = line.split()
output = outfiles.getFile(row[0])
print(' '.join(row[1:]), file=output)
outfiles.close() # close all files before exiting
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.