Use python to extract a specific line from multiple files in the same directory

Question

I have multiple text files named ParticleCoordW_10000.dat, ParticleCooordW_20000.dat, etc... The files all look like this:

ITEM: TIMESTEP
10000
ITEM: NUMBER OF ATOMS
1000
ITEM: BOX BOUNDS pp pp pp
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
0.0000000000000000e+00 9.4000000000000004e+00
ITEM: ATOMS id x y z 
673 1.03559 0.495714 0.575399 
346 2.74458 1.30048 0.0566235 
991 0.570383 0.589025 1.44128 
793 0.654365 1.33452 1.91347 
969 0.217201 0.6852 0.287291
.
. 
. 
.

I'd like to use python to extract the coordinate of a single particle, let us say ATOM ID: 673. The problem is that the line position of ATOM ID:673 changes in every text file. So I'd like to have Python be able to locate ATOM #673 in every text files of the directory and save the associated xyz coordinates.

Previously I was using something like this to obtain all the coordinates:

filenames = glob.glob('*.dat')
for f in filenames:
    x_data = np.loadtxt(f,usecols=[1],skiprows = 9)
    y_data = np.loadtxt(f,usecols=[2],skiprows = 9)
    z_data = np.loadtxt(f,usecols=[3],skiprows = 9)
    coord  = np.vstack((x_data,y_data,z_data)).T

Is there a way to modify this script in order to perform the task previously described?

EDIT: Based on the various comment I wrote the following:

coord = []
filenames = natsort.natsorted(glob.glob('*.dat'))
for f in filenames:
    buff = open(f, 'r').readlines()
    for row in buff:
        if row.startswith('673'):
            coord.append(row)
np.savetxt("xyz.txt",coord,fmt,delimiter=' ')

Which allows me to group all the coordinates of a single particle throughout all the text files in the directory. However I'd like to have this process done for all the particles ID (1000 particles). What would be the most efficient way to do that?

Answer 1

Without more background i can't imagine a method to find the correct line without reading to the line where your Atom Id is located.

You do something like:

with open(FILE) as f:
    for line in f:
        if line.startswith(ID,0,log10(NumberOfAtoms)):
            saverownumber() or extract information

Else you could save/read in the "Mapping" ID <-> row number for each file

However i think you should think about a way to save the positons in an ordered way. Maybe you can also give information in your question, what prevents you from saving the positions ordered by Atom ID.

I can recommend using hdf5 library for storing large datasets with metadata.

Answer 2

You can user Regular Expression to get the data off all the files and then process them as you wish. Something like this may work.

I've assumed that there's nothing after the coordinate values in the file. You will have to run this script from the directory all the files are in.

import os, re

regex = r"^ITEM: ATOMS \d+ x y z.*" # basing on this line being "ITEM: ATOMS 675 x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[temp[0].split()[2]] = temp[1:]

This will give you a dictionary with ATOM ID as key and a list of all coordinates as value. Sample ouput:

output

{'675': ['673 1.03559 0.495714 0.575399 ',
  '346 2.74458 1.30048 0.0566235 ',
  '991 0.570383 0.589025 1.44128 ',
  '793 0.654365 1.33452 1.91347 ',
  '969 0.217201 0.6852 0.287291',
  '']}

Upon reviewing the question, I think I've mis-interpreted the input. The line ITEM: ATOMS id xyz is static across all files. So, I've changed the code a bit.

import os, re

regex = r"^ITEM: ATOMS id x y z.*" # basing on this line being exactly "ITEM: ATOMS id x y z"

output = {} # dictionary to store all coordinates

for file in os.listdir():
    if os.path.isfile(file):
        with open(file,'r') as f:
            data = f.readlines()
            matches = re.findall(regex,''.join(data),re.MULTILINE | re.DOTALL)
            temp = matches[0].split('\n')
            output[file] = temp[1:] # storing against filename as key

Use python to extract a specific line from multiple files in the same directory

Question

2 answers

solution1
0 2019-07-15 22:49:05

solution2
0 2019-07-15 23:47:01

Use python to extract a specific line from multiple files in the same directory

Question

2 answers

solution1 0 2019-07-15 22:49:05

solution2 0 2019-07-15 23:47:01

solution1
0 2019-07-15 22:49:05

solution2
0 2019-07-15 23:47:01