[英]Split a txt file into two files using first column value in Python
I would like to split a INPUT.txt file into two.txt files(Header & Data) by the value of the first column.我想通过第一列的值将 INPUT.txt 文件拆分为两个.txt 文件(标题和数据)。 Data before "H1000" will save in a header.txt file and after/equal to "H1000" will save in data.txt file.
“H1000”之前的数据将保存在 header.txt 文件中,之后/等于“H1000”的数据将保存在 data.txt 文件中。
INPUT.txt输入.txt
H0002 Version 78
H0003 Date_generated 5-Aug-81
H0004 Reporting_period_end_date 09-Jun-81
H1000 State WAAAA
H1002 Teno/Combno Z70/4000
H1003 Tener Magn Reso NL
H1004 LLD
D AC056SCO1 NRM 11 12 6483516 25.98 0.4 1.35 0.25 0.51 0.01 0.06 0.1 56.23 2.29
With the output files being: output 文件为:
header.txt header.txt
H0002 Version 78
H0003 Date_generated 5-Aug-81
H0004 Reporting_period_end_date 09-Jun-81
data.txt数据.txt
H1000 State WAAAA
H1002 Teno/Combno Z70/4000
H1003 Tener Magn Reso NL
H1004 LLD
D AC056SCO1 NRM 11 12 6483516 25.98 0.4 1.35 0.25 0.51 0.01 0.06 0.1 56.23 2.29
Couple of problem that I am facing:我面临的几个问题:
"H1000" position is dynamic in different txt files. “H1000”position在不同的txt文件中是动态的。 If you see another input file see "H1000" position is different(Check Input File2 ).
如果您看到另一个输入文件,请参阅“H1000”position 不同(检查输入文件 2)。 So my python code is first finding the position of H1000.
所以我的 python 代码是首先找到 H1000 的 position。
I am using the position of H1000 for separating Header & Data file.我正在使用 H1000 的 position 来分离 Header 和数据文件。 Logic is not working correctly in separating the files.
逻辑在分离文件时无法正常工作。
My python code:我的 python 代码:
if path_txt.is_file():
txt_files = [Path(path_txt)]
else:
txt_files = list(Path(path_txt).glob("*.txt"))
for fn in txt_files:
with open(fn) as fd_read:
for line in fd_read:
h_value = line.split(maxsplit=1)[0]
value = int(h_value[1:]) #Finding the position of H1000
splitLen = 5 # Position of H1000
HeaderBase = 'Header.txt' # Header.txt
DataBase = 'Data.txt' # Data.txt
with open(fn, 'r') as fp:
input_list = fp.readlines()
# to skip empties: input_list = [l for l in fp if l.strip()]
for i in range(0, len(input_list), splitLen):
with open(HeaderBase, 'w') as fp:
fp.write(''.join(input_list[0:(i-1)])) #Header.txt
with open(DataBase, 'w') as fp:
fp.write(''.join(input_list[i:])) #Data.txt
None of my logic is working.我的逻辑都不起作用。 Any help as I have stuck how to work this logic.
任何帮助,因为我坚持如何处理这个逻辑。
InputFile2输入文件2
H0002 Version 9
H0003 Date_generated 5-Aug-81
H0004 Reporting_period_end_date 09-Jun-99
H0005 State WAAAAA
H1000 Tene_no/Combined_rept_no E79/38975
H1001 Tene_holder Magne Resources NL
D abc3SCO1 NORM 26 27 9483531 4.15 0.05 0.65 0.02 0.15 0 0.04 0.09 87.51 0.29
Your code suffers from numerous issues:您的代码存在许多问题:
H1000
.H1000
的 position 。 I don't see it written in the code.5
, disregarding the position of H1000
.5
,忽略H1000
的 position 。range()
function.range()
function。 You're hopping from start to end in 5 line jumps?i
, you write everything from the start of the document till i
to header.txt
and the rest to data.txt
.i
,您将从文档开始到i
的所有内容写入header.txt
和 rest 到data.txt
。 That means you're writing the entire document multiple times.path_txt
to a Path
object, but then use it regularly like a string.path_txt
更改为Path
object,然后像字符串一样定期使用它。 I couldn't figure out what to do in case a directory is passed, as having all headers in same file and all data in same file is not what you wish I believe.我不知道在传递目录的情况下该怎么做,因为所有标题都在同一个文件中,所有数据都在同一个文件中,这不是您希望我相信的。
Fixed code for a single file:单个文件的固定代码:
SPLIT_TOKEN = "H1000"
def split_file(path, header_path="header.txt", data_path="data.txt"):
"""Split a file to a header and data file upon encountering a token."""
header = []
data = []
with open(path, "r") as f:
for line in f:
if line.startswith(SPLIT_TOKEN):
break
header.append(line)
data.append(line) # Add the line with the token
data.extend(f)
with open(header_path, "w") as f:
f.writelines(header)
with open(data_path, "w") as f:
f.writelines(data)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.