[英]How to Extract columns (almost the same) between two strings , using python
我有一个非常大的文本文件,其中包含1339018行,我想提取三个部分:
我的FILE.txt
.
.
.
-----------------------
first ATOMIC CHARGES
-----------------------
0 C : -0.157853
1 C : -0.156875
2 C : -0.143714
3 C : -0.140489
4 S : 0.058926
5 H : 0.128758
6 H : 0.128814
7 H : 0.142420
8 H : 0.140013
My charges : -0.0000000
------------------------
.
..
.
-----------------------
first ATOMIC CHARGES AND SPIN
-----------------------
0 C : -0.208137 0.054313
1 C : -0.206691 0.053890
2 C : -0.266791 0.395830
3 C : -0.262729 0.395691
4 S : -0.184730 0.179002
5 H : 0.023341 -0.009535
6 H : 0.023405 -0.009489
7 H : 0.042728 -0.029862
8 H : 0.039605 -0.029841
My charges : -1.0000000
------------------------
.
.
.
.
-----------------------
first ATOMIC CHARGES AND SPIN
-----------------------
0 C : -0.086045 0.075562
1 C : -0.085256 0.075871
2 C : 0.022683 0.483590
3 C : 0.025286 0.483583
4 S : 0.246328 -0.079498
5 H : 0.215005 -0.003936
6 H : 0.215043 -0.003948
7 H : 0.224379 -0.015598
8 H : 0.222578 -0.015627
My charges : 1.0000000
------------------------
.
.
.
我尝试使用以下脚本,以提取第四列并将其转换为列表(例如:
oX = [-0.157853,-0.156875,-0.143714 ...]
oY = [-0.208137,-0.206691,...]
oZ = [-0.086045,-0.085256,...]
但不幸的是,第三个循环不起作用。
with open('FILE.txt', 'rb') as f:
textfile_temp = f.read()
print textfile_temp.split('first ATOMIC CHARGES')[1].split("My charges : -0.0000000")[0]
print textfile_temp.split('first ATOMIC CHARGES AND SPIN')[1].split("My charges : -1.0000000")[0]
print textfile_temp.split('first ATOMIC CHARGES AND SPIN')[1].split("My charges : 1.0000000")[0]
可能吗??
尝试在最后一行进行一个细微的更改,如下所示。 你很亲密!
with open('FILE.txt', 'rb') as f:
textfile_temp = f.read()
print textfile_temp.split('first ATOMIC CHARGES')[1].split("My charges : -0.0000000")[0]
print textfile_temp.split('first ATOMIC CHARGES AND SPIN')[1].split("My charges : -1.0000000")[0]
print textfile_temp.split('first ATOMIC CHARGES AND SPIN')[2].split("My charges : 1.0000000")[0]
# ^ change this
您可以使用正则表达式提取所需的值:
import re
data = []
block = []
with open('input.txt') as f_input:
for row in f_input:
values = re.findall('\s+\d+.*?(-?\d+\.\d+)', row)
if len(values):
block.append(float(values[0]))
elif row.startswith('first ATOMIC') and len(block):
data.append(block)
block = []
if len(block):
data.append(block)
oX, oY, oZ = data
print oX
print oY
print oZ
这将打印:
[-0.157853, -0.156875, -0.143714, -0.140489, 0.058926, 0.128758, 0.128814, 0.14242, 0.140013]
[-0.208137, -0.206691, -0.266791, -0.262729, -0.18473, 0.023341, 0.023405, 0.042728, 0.039605]
[-0.086045, -0.085256, 0.022683, 0.025286, 0.246328, 0.215005, 0.215043, 0.224379, 0.222578]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.