简体   繁体   中英

Extra characters ('.') appended to data from multiple files while writing to .csv file with python2 script

I am trying to make a relatively simple script in python2 which crawls through multiple .out files in a directory and extracts some data. The data is then written to a .csv file along with an identifier.

My issue is that a seemingly random '.' or '..' is appended to the end of the data string.

Here is my code (I know this is horrible to look at, sorry in advance) :

import os
import string
import time
import sys
import csv

input = raw_input
location = input('Set directory path: ')
os.makedirs(location+'/outputs/')
print "Created output directory."
print "Waiting for archiving to finish..."
forCall = "cd "+location+" && mv *.out outputs/"
os.system(forCall)
time.sleep(1)
print "Archived output files."

newLocation = location+"/outputs/"


def checker(filein, bondlength):
    o = open("results.csv", "a")
    with open(filein) as curFile:
        for line in curFile:
            if "SCF Done:" in line:
                var = line
                var = filter(lambda x: x.isdigit() or x == '-' or x == '.', var)
                var = var[1:-2] # slices the first '-' and two trailing '.'

                bondlength = ''.join(bondlength.split())
                bondlength = bondlength[:-4] # slices .out from 'bondlength.out'
                o.write(var+';'+bondlength+'\n')
    o.close()

for filename in os.listdir(newLocation):
    fileLocation = newLocation+filename
    checker(fileLocation, filename)

datacsv = csv.reader(open('results.csv'), delimiter=";")
sortedData = sorted(datacsv, key=lambda row: row[1], reverse=False)

with open('sortedData.csv', 'wb') as csv_file:
    wr = csv.writer(csv_file, delimiter=";")
    wr.writerows(sortedData)

The line in the .out file that I'm interested in looks like this:

SCF Done: E(RB+HF-LYP) = -107.450926197 AU after 5 cycles

Now I need to get the value of E(whatever computational method was used) for each .out file and append it to a .csv file with 2 columns: one for the energy and one for the bond length (multiplied by 10^3, but that doesn't really matter now), which is the name of the .out file (ex. 1036.out).

Any help is greatly appreciated.

The problem is in your approach to extracting the data - filtering out characters that are not digits, dashes or dots from your example line would result in --107.450926197..5 - the first dash comes from the HF-LYP part, the trailing 5 comes from the 5 cycles and the two dots preceding it come from AU . When you slice out the first and last characters of the substring you'll therefore get -107.450926197.. .

What I'd suggest instead is to find the number by finding the = in your string and then chopping out everything until the next whitespace, something like:

var = "SCF Done: E(RB+HF-LYP) = -107.450926197 A.U. after 5 cycles"
var = var[var.find("=") + 1:].strip()  # clean out everything before the equal sign
var = var[:var.find(" ") + 1].strip()  # clean out everything after the first whitespace
# -107.450926197

Or slightly more unsafe by splitting on = and then on space:

var = "SCF Done: E(RB+HF-LYP) = -107.450926197 A.U. after 5 cycles"
var = var.split("=", 1)[1].split(None, 1)[0]
# -107.450926197

Or to do it with a simple regex:

import re

find_numbers = re.compile(r"-?[0-9]\d*(\.\d+)?")  # find any number

var = "SCF Done: E(RB+HF-LYP) = -107.450926197 A.U. after 5 cycles"
var = find_numbers.search(var).group()
# -107.450926197

You should also consider loading your current results first, then writing to the same list as you iterate over your *.out files, sort that list and overwrite the results.csv .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM