简体   繁体   English

使用 Python 中的第一列值将 txt 文件拆分为两个文件

[英]Split a txt file into two files using first column value in Python

I would like to split a INPUT.txt file into two.txt files(Header & Data) by the value of the first column.我想通过第一列的值将 INPUT.txt 文件拆分为两个.txt 文件(标题和数据)。 Data before "H1000" will save in a header.txt file and after/equal to "H1000" will save in data.txt file. “H1000”之前的数据将保存在 header.txt 文件中,之后/等于“H1000”的数据将保存在 data.txt 文件中。

INPUT.txt输入.txt

H0002   Version 78                                                                                                                      
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-81                                                                                                                       
H1000   State   WAAAA                                                                                                                       
H1002   Teno/Combno Z70/4000                                                                                                                        
H1003   Tener   Magn Reso NL    
H1004   LLD                                                                                     
D   AC056SCO1   NRM 11  12  6483516 25.98   0.4 1.35    0.25    0.51    0.01    0.06    0.1 56.23   2.29

With the output files being: output 文件为:

header.txt header.txt

H0002   Version 78                                                                                                                      
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-81

data.txt数据.txt

H1000   State   WAAAA                                                                                                                       
H1002   Teno/Combno Z70/4000                                                                                                                        
H1003   Tener   Magn Reso NL    
H1004   LLD                                                                                     
D   AC056SCO1   NRM 11  12  6483516 25.98   0.4 1.35    0.25    0.51    0.01    0.06    0.1 56.23   2.29

Couple of problem that I am facing:我面临的几个问题:

  1. "H1000" position is dynamic in different txt files. “H1000”position在不同的txt文件中是动态的。 If you see another input file see "H1000" position is different(Check Input File2 ).如果您看到另一个输入文件,请参阅“H1000”position 不同(检查输入文件 2)。 So my python code is first finding the position of H1000.所以我的 python 代码是首先找到 H1000 的 position。

  2. I am using the position of H1000 for separating Header & Data file.我正在使用 H1000 的 position 来分离 Header 和数据文件。 Logic is not working correctly in separating the files.逻辑在分离文件时无法正常工作。

My python code:我的 python 代码:

if path_txt.is_file():
        txt_files = [Path(path_txt)] 
    else:
        txt_files = list(Path(path_txt).glob("*.txt"))
    
    for fn in txt_files:
       with open(fn) as fd_read:
            for line in fd_read:
               h_value = line.split(maxsplit=1)[0]
               value = int(h_value[1:]) #Finding the position of H1000
                   
            splitLen = 5  # Position of H1000
            HeaderBase = 'Header.txt'  # Header.txt
            DataBase = 'Data.txt'  # Data.txt

            with open(fn, 'r') as fp:
                input_list = fp.readlines()
                # to skip empties: input_list = [l for l in fp if l.strip()]

            for i in range(0, len(input_list), splitLen):
                with open(HeaderBase, 'w') as fp:
                    fp.write(''.join(input_list[0:(i-1)])) #Header.txt
                with open(DataBase, 'w') as fp:
                    fp.write(''.join(input_list[i:]))   #Data.txt  

None of my logic is working.我的逻辑都不起作用。 Any help as I have stuck how to work this logic.任何帮助,因为我坚持如何处理这个逻辑。

InputFile2输入文件2

H0002   Version 9                                                                                                                       
H0003   Date_generated  5-Aug-81                                                                                                                        
H0004   Reporting_period_end_date   09-Jun-99                                                                                                                       
H0005   State   WAAAAA                                                                                                                      
H1000   Tene_no/Combined_rept_no    E79/38975                                                                                                                       
H1001   Tene_holder Magne Resources NL  
D   abc3SCO1    NORM    26  27  9483531 4.15    0.05    0.65    0.02    0.15    0   0.04    0.09    87.51   0.29

Python code and txt file attached here Python 代码和txt文件附在这里

Your code suffers from numerous issues:您的代码存在许多问题:

  1. You don't actually find the position of H1000 .您实际上没有找到H1000的 position 。 I don't see it written in the code.我没有看到它写在代码中。
  2. You set the split to be 5 , disregarding the position of H1000 .您将拆分设置为5 ,忽略H1000的 position 。
  3. I don't understand your range() function.我不明白你的range() function。 You're hopping from start to end in 5 line jumps?你在 5 次跳线中从头跳到尾?
  4. For every jump i , you write everything from the start of the document till i to header.txt and the rest to data.txt .对于每次跳转i ,您将从文档开始到i的所有内容写入header.txt和 rest 到data.txt That means you're writing the entire document multiple times.这意味着您要多次编写整个文档。
  5. You change path_txt to a Path object, but then use it regularly like a string.您将path_txt更改为Path object,然后像字符串一样定期使用它。

I couldn't figure out what to do in case a directory is passed, as having all headers in same file and all data in same file is not what you wish I believe.我不知道在传递目录的情况下该怎么做,因为所有标题都在同一个文件中,所有数据都在同一个文件中,这不是您希望我相信的。

Fixed code for a single file:单个文件的固定代码:

SPLIT_TOKEN = "H1000"

def split_file(path, header_path="header.txt", data_path="data.txt"):
    """Split a file to a header and data file upon encountering a token."""
    header = []
    data = []
    with open(path, "r") as f:
        for line in f:
            if line.startswith(SPLIT_TOKEN):
                break
            header.append(line)
        
        data.append(line)  # Add the line with the token
        data.extend(f)

    with open(header_path, "w") as f:
        f.writelines(header)
    with open(data_path, "w") as f:
        f.writelines(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Python 根据第一列将 xlsx 文件拆分为其他 xlsx 文件 - Split xlsx file into other xlsx files based on first column with Python 分成两个列表一个txt文件python - split in two list a txt file python 如何通过Python将单个txt文件拆分为多个txt文件 - how to split single txt file into multiple txt files by Python 如何使用 python 保存 txt 文件并制作两列表? - How to save txt file and make two column table using python? 如何使用 pandas 或 numpy(python)将文本文件中的 integer 值从一列拆分为两列 - how to split an integer value from one column to two columns in text file using pandas or numpy (python) Python:如何将.txt文件拆分为两个或多个文件,每个文件中的行数相同? - Python: How do I split a .txt file into two or more files with the same number of lines in each? 如何通过python(来自txt文件)将很长的行数据分成两列 - how to split a very long row data into two column by python (from txt file) 在 Python 中使用 difflib 比较两个 .txt 文件 - Comparing two .txt files using difflib in Python 根据列值中第一次出现的项目将数据框列拆分为两个 - Split dataframe column into two based on first occurrence of an item in column value Python-从txt文件读取前两行 - Python - Reading first two lines from txt files
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM