[英]How to split a file by using string as identifier with python?
我有一個巨大的文本文件,需要將其拆分為某個文件。 在文本文件中有一個標識符來拆分文件。 這是文本文件的一部分,如下所示:
Comp MOFVersion 10.1
Copyright 1997-2006. All rights reserved.
--------------------------------------------------
Mon 11/19/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
...
exit
---------------------
list volume
list partition
exit
---------------------
Volume 0 is the selected volume.
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
* Disk 0 Online 238 GB 136 GB *
--------------------------------------------------
Tue 11/20/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
....
SERVICE_NAME: vds
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
---------------------
*exit /b 0
File not found - *.*
0 File(s) copied
--------------------------------------------------
Wed 11/21/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
---------------------
*exit /b 0
11/19/2021 08:34 AM <DIR> .
11/19/2021 08:34 AM <DIR> ..
11/19/2021 08:34 AM 0 SL
1 File(s) 0 bytes
2 Dir(s) 80,160,923,648 bytes free
我的期望是通過映射字符串“Starting The Process”來拆分文件。 因此,如果我有一個像上面示例一樣的文本文件,那么該文件將拆分為 3 個文件,每個文件都有不同的內容。 例如:
file1
--------------------------------------------------
Mon 11/19/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
...
exit
---------------------
list volume
list partition
exit
---------------------
Volume 0 is the selected volume.
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
* Disk 0 Online 238 GB 136 GB *
file2
--------------------------------------------------
Tue 11/20/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
....
SERVICE_NAME: vds
TYPE : 10 WIN32_OWN_PROCESS
STATE : 1 STOPPED
WIN32_EXIT_CODE : 0 (0x0)
SERVICE_EXIT_CODE : 0 (0x0)
CHECKPOINT : 0x0
WAIT_HINT : 0x0
---------------------
*exit /b 0
File not found - *.*
0 File(s) copied
file 3
--------------------------------------------------
Wed 11/21/2022 8:34:22.35 - Starting The Process...
--------------------------------------------------
There are a lot of content here
==========================================
Computer: .
==========================================
Active: True
DmiRevision: 0
list disk
exit
---------------------
*exit /b 0
11/19/2021 08:34 AM <DIR> .
11/19/2021 08:34 AM <DIR> ..
11/19/2021 08:34 AM 0 SL
1 File(s) 0 bytes
2 Dir(s) 80,160,923,648 bytes free
這是我試過的:
logfile = "E:/DATA/result.txt"
with open(logfile, 'r') as text_file:
lines = text_file.readlines()
for line in lines:
if "Starting The Process..." in line:
print(line)
我只能找到帶有字符串的行,但我不知道如何在拆分為 3 個部分和 output 到新文件后獲取每一行的內容。
Python可以嗎? 謝謝你的任何建議。
好吧,如果文件足夠小以輕松放入 memory(比如 1GB 或更少),您可以將整個文件讀入一個字符串,然后使用re.findall
:
with open('data.txt', 'r') as file:
data = file.read()
parts = re.findall(r'-{10,}[^-]*\n\w{3} \d{2}\/\d{2}\/\d{4}.*?-{10,}.*?(?=-{10,}|$)', data, flags=re.S)
cnt = 1
for part in parts:
output = open('file ' + str(cnt), 'w')
output.write(part)
output.close()
cnt = cnt + 1
如果文件中的破折號長度固定,則另一種解決方案可能是:
with open('file.txt', 'r') as f:
split_text = f.read().split('--------------------------------------------------')
split_text.pop(0) # To remove the Copyright message at the start
for i in range(0, len(split_text) - 1, 2):
with open(f'file{int(i/2)}.txt', 'w') as temp:
temp_txt = ''.join(split_text[i:i+2])
temp.write(temp_txt)
本質上,我只是在這些破折號的基礎上拆分並連接每個連續的元素。 通過這種方式,您可以將有關時間戳的信息保存在每個文件的內容中。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.