简体   繁体   中英

Parsing a text file in python and outputting to a CSV

Preface - I'm pretty new to Python, having had more experience in another language.

I have a text file with single column list of strings in the generic (but slightly varying) format "./abc123a1/type/1ab2_x_data_type.file.type"

I need to extract the abc123a1 and the 1ab2 portions from all several hundred of the rows and put them under two columns (column a and b) in a csv. Sometimes there may be a "1ab2_a" and a "1ab2_b", but I only want one 1ab2. So I'd want to grab "1ab2_a" and ignore all others.

I have the regex which I THINK will work:

tmp = list()
if re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x):
    tmp = re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x)
elif re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x):
    tmp = re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x)
if len(tmp) == 0:
    return None
elif len(tmp) > 1:
    print "ERROR found multiple matches"
    return "ERROR"
else:
    return tmp[0].upper()

I am trying to make this script step by step and testing things to make sure it works, but it's just not.

import sys
import csv

listOfData = []

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])
print listOfData

with open('extracted.csv', 'w') as out_file:
    writer = csv.writer(out_file)
    writer.writerow(('column a', 'column b'))
    writer.writerows(listOfData)

print listOfData

Still failing to get anything in the csv other than column headers, much less a parsed version!

Does anyone have any better ideas or formats I could do this in? A friend mentioned looking into glob.glob, but I haven't had luck getting that to work either.

IMHO, you were not far from making it work. The problem is that you read once the whole file just to print the lines, and then (once at end of file) you try to put them into a list... and get an empty list !

You should read the file only once:

import sys
import csv

listOfData = []

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])
print listOfData

with open('extracted.csv', 'w') as out_file:
    writer = csv.writer(out_file)
    writer.writerow(('column a', 'column b'))
    writer.writerows(listOfData)

print listOfData

once it works, you still have to use the regex to get relevant data to put into the csv file

I am not sure about your regex (it will most probably not work) , but the reason why your current (non-regex , simple) code does not work is because -

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])

As you can see you are first iterating over each line in file and printing it, it should be fine, but after the loop ends, the file pointer is at the end of file, so trying to iterate over it again , would not produce any result. You should only iterate over it once, and do both printing and appending to list in it. Example -

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])

I think at least part of the problem is the two for loops in the following:

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
    for line in f:
        listOfData.append([line])

The first one print s all the lines of f , so there's nothing left for the second one to iterate over unless you first f.seek(0) and rewind the file.

An alternative way would to simply to this:

with open(sys.argv[1]) as f:
    print "yes"
    for line in f:
        print line
        listOfData.append([line])

It's hard to tell if your regexes are OK without more than one line of sample input data.

Are you sure you need all of the regular expressions? You seem to be parsing a list of paths and filenames. The path could be split up using a split command, for example:

print "./abc123a1/type/1ab2_a_data_type.file.type".split("/")

Would give:

['.', 'abc123a1', 'type', '1ab2_a_data_type.file.type']

You could then create a set consisting of the second entry and up to the '_' in forth entry, eg

('abc123a1', '1ab2')

This could then be used to print only the first entry from each:

pairs = set()

with open(sys.argv[1], 'r') as in_file, open('extracted.csv', 'wb') as out_file:
    writer = csv.writer(out_file)

    for row in in_file:
        folders = row.split("/")
        col_a = folders[1]
        col_b = folders[3].split("_")[0]

        if (col_a, col_b) not in pairs:
            pairs.add((col_a, col_b))
            writer.writerow([col_a, col_b])

So for an input looking like this:

./abc123a1/type/1ab2_a_data_type.file.type
./abc123a1/type/1ab2_b_data_type.file.type
./abc123a2/type/1ab2_a_data_type.file.type
./abc123a3/type/1ab2_a_data_type.file.type

You would get a CSV file looking like:

abc123a1,1ab2
abc123a2,1ab2
abc123a3,1ab2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM