簡體   English   中英

如何使用字符串作為 python 的標識符來拆分文件?

[英]How to split a file by using string as identifier with python?

我有一個巨大的文本文件,需要將其拆分為某個文件。 在文本文件中有一個標識符來拆分文件。 這是文本文件的一部分,如下所示:

Comp MOFVersion 10.1
Copyright 1997-2006. All rights reserved.
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...

exit 
--------------------- 
list volume 
list partition 
exit
--------------------- 

Volume 0 is the selected volume.

Disk ###  Status         Size     Free     Dyn  Gpt
--------  -------------  -------  -------  ---  ---
* Disk 0    Online          238 GB   136 GB        *

-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....
SERVICE_NAME: vds 
    TYPE               : 10  WIN32_OWN_PROCESS  
    STATE              : 1  STOPPED 
    WIN32_EXIT_CODE    : 0  (0x0)
    SERVICE_EXIT_CODE  : 0  (0x0)
    CHECKPOINT         : 0x0
    WAIT_HINT          : 0x0
--------------------- 
*exit /b 0 
File not found - *.*
0 File(s) copied

-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
--------------------- 
*exit /b 0 

11/19/2021  08:34 AM    <DIR>          .
11/19/2021  08:34 AM    <DIR>          ..
11/19/2021  08:34 AM                 0 SL
               1 File(s)              0 bytes
               2 Dir(s)  80,160,923,648 bytes free

我的期望是通過映射字符串“Starting The Process”來拆分文件。 因此,如果我有一個像上面示例一樣的文本文件,那么該文件將拆分為 3 個文件,每個文件都有不同的內容。 例如:

file1
-------------------------------------------------- 
Mon 11/19/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
...

exit 
--------------------- 
list volume 
list partition 
exit
--------------------- 

Volume 0 is the selected volume.

Disk ###  Status         Size     Free     Dyn  Gpt
--------  -------------  -------  -------  ---  ---
* Disk 0    Online          238 GB   136 GB        *


file2
-------------------------------------------------- 
Tue 11/20/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here
....
SERVICE_NAME: vds 
    TYPE               : 10  WIN32_OWN_PROCESS  
    STATE              : 1  STOPPED 
    WIN32_EXIT_CODE    : 0  (0x0)
    SERVICE_EXIT_CODE  : 0  (0x0)
    CHECKPOINT         : 0x0
    WAIT_HINT          : 0x0
--------------------- 
*exit /b 0 
File not found - *.*
0 File(s) copied

file 3
-------------------------------------------------- 
Wed 11/21/2022 8:34:22.35 - Starting The Process... 
-------------------------------------------------- 

There are a lot of content here

==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
--------------------- 
*exit /b 0 

11/19/2021  08:34 AM    <DIR>          .
11/19/2021  08:34 AM    <DIR>          ..
11/19/2021  08:34 AM                 0 SL
               1 File(s)              0 bytes
               2 Dir(s)  80,160,923,648 bytes free

這是我試過的:

logfile = "E:/DATA/result.txt"
with open(logfile, 'r') as text_file:
    lines = text_file.readlines()
    for line in lines:
        if "Starting The Process..." in line:
            print(line)

我只能找到帶有字符串的行,但我不知道如何在拆分為 3 個部分和 output 到新文件后獲取每一行的內容。

Python可以嗎? 謝謝你的任何建議。

好吧,如果文件足夠小以輕松放入 memory(比如 1GB 或更少),您可以將整個文件讀入一個字符串,然后使用re.findall

with open('data.txt', 'r') as file:
    data = file.read()
    parts = re.findall(r'-{10,}[^-]*\n\w{3} \d{2}\/\d{2}\/\d{4}.*?-{10,}.*?(?=-{10,}|$)', data, flags=re.S)

cnt = 1
for part in parts:
    output = open('file ' + str(cnt), 'w')
    output.write(part)
    output.close()
    cnt = cnt + 1

如果文件中的破折號長度固定,則另一種解決方案可能是:

with open('file.txt', 'r') as f: 
split_text = f.read().split('--------------------------------------------------')
split_text.pop(0) # To remove the Copyright message at the start

for i in range(0, len(split_text) - 1, 2): 
    with open(f'file{int(i/2)}.txt', 'w') as temp: 
        temp_txt = ''.join(split_text[i:i+2])
        temp.write(temp_txt)    

本質上,我只是在這些破折號的基礎上拆分並連接每個連續的元素。 通過這種方式,您可以將有關時間戳的信息保存在每個文件的內容中。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM