Extra characters ('.') appended to data from multiple files while writing to .csv file with python2 script

Question

I am trying to make a relatively simple script in python2 which crawls through multiple .out files in a directory and extracts some data. The data is then written to a .csv file along with an identifier.

My issue is that a seemingly random '.' or '..' is appended to the end of the data string.

Here is my code (I know this is horrible to look at, sorry in advance) :

import os
import string
import time
import sys
import csv

input = raw_input
location = input('Set directory path: ')
os.makedirs(location+'/outputs/')
print "Created output directory."
print "Waiting for archiving to finish..."
forCall = "cd "+location+" && mv *.out outputs/"
os.system(forCall)
time.sleep(1)
print "Archived output files."

newLocation = location+"/outputs/"


def checker(filein, bondlength):
    o = open("results.csv", "a")
    with open(filein) as curFile:
        for line in curFile:
            if "SCF Done:" in line:
                var = line
                var = filter(lambda x: x.isdigit() or x == '-' or x == '.', var)
                var = var[1:-2] # slices the first '-' and two trailing '.'

                bondlength = ''.join(bondlength.split())
                bondlength = bondlength[:-4] # slices .out from 'bondlength.out'
                o.write(var+';'+bondlength+'\n')
    o.close()

for filename in os.listdir(newLocation):
    fileLocation = newLocation+filename
    checker(fileLocation, filename)

datacsv = csv.reader(open('results.csv'), delimiter=";")
sortedData = sorted(datacsv, key=lambda row: row[1], reverse=False)

with open('sortedData.csv', 'wb') as csv_file:
    wr = csv.writer(csv_file, delimiter=";")
    wr.writerows(sortedData)

The line in the .out file that I'm interested in looks like this:

SCF Done: E(RB+HF-LYP) = -107.450926197 AU after 5 cycles

Now I need to get the value of E(whatever computational method was used) for each .out file and append it to a .csv file with 2 columns: one for the energy and one for the bond length (multiplied by 10^3, but that doesn't really matter now), which is the name of the .out file (ex. 1036.out).

Any help is greatly appreciated.

Answer 1

The problem is in your approach to extracting the data - filtering out characters that are not digits, dashes or dots from your example line would result in --107.450926197..5 - the first dash comes from the HF-LYP part, the trailing 5 comes from the 5 cycles and the two dots preceding it come from AU . When you slice out the first and last characters of the substring you'll therefore get -107.450926197.. .

What I'd suggest instead is to find the number by finding the = in your string and then chopping out everything until the next whitespace, something like:

var = "SCF Done: E(RB+HF-LYP) = -107.450926197 A.U. after 5 cycles"
var = var[var.find("=") + 1:].strip()  # clean out everything before the equal sign
var = var[:var.find(" ") + 1].strip()  # clean out everything after the first whitespace
# -107.450926197

Or slightly more unsafe by splitting on = and then on space:

var = "SCF Done: E(RB+HF-LYP) = -107.450926197 A.U. after 5 cycles"
var = var.split("=", 1)[1].split(None, 1)[0]
# -107.450926197

Or to do it with a simple regex:

import re

find_numbers = re.compile(r"-?[0-9]\d*(\.\d+)?")  # find any number

var = "SCF Done: E(RB+HF-LYP) = -107.450926197 A.U. after 5 cycles"
var = find_numbers.search(var).group()
# -107.450926197

You should also consider loading your current results first, then writing to the same list as you iterate over your *.out files, sort that list and overwrite the results.csv .

Extra characters ('.') appended to data from multiple files while writing to .csv file with python2 script

Question

1 answers

solution1
0 ACCPTED 2017-12-20 10:51:39

Extra characters ('.') appended to data from multiple files while writing to .csv file with python2 script

Question

1 answers

solution1 0 ACCPTED 2017-12-20 10:51:39

solution1
0 ACCPTED 2017-12-20 10:51:39