使用python從同一目錄中的多個文件中提取特定行

Question

我有多個名為“ ParticleCoordW_10000.dat”，“ ParticleCooordW_20000.dat”等的文本文件。所有文件看起來都像這樣：

ITEM: TIMESTEP
10000
ITEM: NUMBER OF ATOMS
1000
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
ITEM: ATOMS id x y z 
673 1.03559 0.495714 0.575399 
346 2.74458 1.30048 0.0566235 
991 0.570383 0.589025 1.44128 
793 0.654365 1.33452 1.91347 
969 0.217201 0.6852 0.287291
.
. 
. 
.

我想使用python提取單個粒子的坐標，讓我們說ATOM ID：673。問題是ATOM ID：673的行位置在每個文本文件中都會變化。 因此，我希望Python能夠在目錄的每個文本文件中找到ATOM＃673並保存關聯的xyz坐標。

以前我使用類似這樣的方法來獲取所有坐標：

filenames = glob.glob('*.dat')
for f in filenames:
    x_data = np.loadtxt(f,usecols=[1],skiprows = 9)
    y_data = np.loadtxt(f,usecols=[2],skiprows = 9)
    z_data = np.loadtxt(f,usecols=[3],skiprows = 9)
    coord  = np.vstack((x_data,y_data,z_data)).T

有沒有一種方法可以修改此腳本以執行前面描述的任務？

編輯：基於各種評論，我寫了以下內容：

coord = []
filenames = natsort.natsorted(glob.glob('*.dat'))
for f in filenames:
    buff = open(f, 'r').readlines()
    for row in buff:
        if row.startswith('673'):
            coord.append(row)
np.savetxt("xyz.txt",coord,fmt,delimiter=' ')

這使我可以將目錄中所有文本文件中單個粒子的所有坐標分組。 但是，我想對所有粒子ID（1000個粒子）執行此過程。 最有效的方法是什么？

Answer 1

沒有更多的背景，我將無法想象一種無需閱讀Atom ID所在的行就能找到正確行的方法。

您可以執行以下操作：

with open(FILE) as f:
    for line in f:
        if line.startswith(ID,0,log10(NumberOfAtoms)):
            saverownumber() or extract information

否則，您可以保存/讀取每個文件的“映射” ID <->行號

但是我認為您應該考慮一種以有序方式保存正電子的方法。 也許您也可以在問題中提供信息，這是什么導致您無法保存按Atom ID排序的職位。

我可以建議使用hdf5庫來存儲帶有元數據的大型數據集。

Answer 2

您可以使用正則表達式從所有文件中獲取數據，然后根據需要對其進行處理。 這樣的事情可能會起作用。

我假設文件中的坐標值之后沒有任何內容。 您將必須從所有文件所在的目錄中運行此腳本。

import os, re

regex = r"^ITEM: ATOMS \d+ x y z.*" # basing on this line being "ITEM: ATOMS 675 x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[temp[0].split()[2]] = temp[1:]

這將為您提供一個字典，其中ATOM ID為鍵，所有坐標的列表為值。 樣本輸出：

output

{'675': ['673 1.03559 0.495714 0.575399 ',
  '346 2.74458 1.30048 0.0566235 ',
  '991 0.570383 0.589025 1.44128 ',
  '793 0.654365 1.33452 1.91347 ',
  '969 0.217201 0.6852 0.287291',
  '']}

在審查問題后，我認為我對輸入內容有誤解。 “ ITEM: ATOMS id xyz ”行在所有文件中都是靜態的。 因此，我對代碼進行了一些更改。

import os, re

regex = r"^ITEM: ATOMS id x y z.*" # basing on this line being exactly "ITEM: ATOMS id x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[file] = temp[1:] # storing against filename as key

使用python從同一目錄中的多個文件中提取特定行

問題描述

2 個解決方案

解決方案1
0 2019-07-15 22:49:05

解決方案2
0 2019-07-15 23:47:01

使用python從同一目錄中的多個文件中提取特定行

問題描述

2 個解決方案

解決方案1 0 2019-07-15 22:49:05

解決方案2 0 2019-07-15 23:47:01

解決方案1
0 2019-07-15 22:49:05

解決方案2
0 2019-07-15 23:47:01