简体   繁体   中英

Split a txt file into two files using first column value in Python

I would like to split a INPUT.txt file into two.txt files(Header & Data) by the value of the first column. Data before "H1000" will save in a header.txt file and after/equal to "H1000" will save in data.txt file.

INPUT.txt

H0002   Version 78                                                                                                                      
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-81                                                                                                                       
H1000   State   WAAAA                                                                                                                       
H1002   Teno/Combno Z70/4000                                                                                                                        
H1003   Tener   Magn Reso NL    
H1004   LLD                                                                                     
D   AC056SCO1   NRM 11  12  6483516 25.98   0.4 1.35    0.25    0.51    0.01    0.06    0.1 56.23   2.29

With the output files being:

header.txt

H0002   Version 78                                                                                                                      
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-81

data.txt

H1000   State   WAAAA                                                                                                                       
H1002   Teno/Combno Z70/4000                                                                                                                        
H1003   Tener   Magn Reso NL    
H1004   LLD                                                                                     
D   AC056SCO1   NRM 11  12  6483516 25.98   0.4 1.35    0.25    0.51    0.01    0.06    0.1 56.23   2.29

Couple of problem that I am facing:

  1. "H1000" position is dynamic in different txt files. If you see another input file see "H1000" position is different(Check Input File2 ). So my python code is first finding the position of H1000.

  2. I am using the position of H1000 for separating Header & Data file. Logic is not working correctly in separating the files.

My python code:

if path_txt.is_file():
        txt_files = [Path(path_txt)] 
    else:
        txt_files = list(Path(path_txt).glob("*.txt"))
    
    for fn in txt_files:
       with open(fn) as fd_read:
            for line in fd_read:
               h_value = line.split(maxsplit=1)[0]
               value = int(h_value[1:]) #Finding the position of H1000
                   
            splitLen = 5  # Position of H1000
            HeaderBase = 'Header.txt'  # Header.txt
            DataBase = 'Data.txt'  # Data.txt

            with open(fn, 'r') as fp:
                input_list = fp.readlines()
                # to skip empties: input_list = [l for l in fp if l.strip()]

            for i in range(0, len(input_list), splitLen):
                with open(HeaderBase, 'w') as fp:
                    fp.write(''.join(input_list[0:(i-1)])) #Header.txt
                with open(DataBase, 'w') as fp:
                    fp.write(''.join(input_list[i:]))   #Data.txt  

None of my logic is working. Any help as I have stuck how to work this logic.

InputFile2

H0002   Version 9                                                                                                                       
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-99                                                                                                                       
H0005   State   WAAAAA                                                                                                                      
H1000   Tene_no/Combined_rept_no    E79/38975                                                                                                                       
H1001   Tene_holder Magne Resources NL  
D   abc3SCO1    NORM    26  27  9483531 4.15    0.05    0.65    0.02    0.15    0   0.04    0.09    87.51   0.29

Python code and txt file attached here

Your code suffers from numerous issues:

  1. You don't actually find the position of H1000 . I don't see it written in the code.
  2. You set the split to be 5 , disregarding the position of H1000 .
  3. I don't understand your range() function. You're hopping from start to end in 5 line jumps?
  4. For every jump i , you write everything from the start of the document till i to header.txt and the rest to data.txt . That means you're writing the entire document multiple times.
  5. You change path_txt to a Path object, but then use it regularly like a string.

I couldn't figure out what to do in case a directory is passed, as having all headers in same file and all data in same file is not what you wish I believe.

Fixed code for a single file:

SPLIT_TOKEN = "H1000"

def split_file(path, header_path="header.txt", data_path="data.txt"):
    """Split a file to a header and data file upon encountering a token."""
    header = []
    data = []
    with open(path, "r") as f:
        for line in f:
            if line.startswith(SPLIT_TOKEN):
                break
            header.append(line)
        
        data.append(line)  # Add the line with the token
        data.extend(f)

    with open(header_path, "w") as f:
        f.writelines(header)
    with open(data_path, "w") as f:
        f.writelines(data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM