Process data from several text files

Question

Any recommendation on how I can grab data from several text files and process them (compute totals for example). I have been trying to do it in Python but keep on encountering dead ends.

A machine generates a summary file in text format each time you do an operation, for this example, screening good apples from batches. First you load the apples, then good is separated from bad, and then you can reload the bad apples again to retest them and some are recovered. so at least 2 summary file is generated per batch, depending on how many times you load the apples to recover good.

This is an example of the text file:

file1:

general Info:
    Batch No.        : A2J3
    Operation        : Test
    Fruit            : Apple
    Operation Number : A5500
    Quantity In      : 10
yield info:
    S1   S2     Total    Bin Name
     5    2       7      good
     1    2       3      bad

file2:

general Info:
    Batch No.        : A2J3
    Operation        : Test
    Fruit            : Apple
    Operation Number : A5500
    Quantity In      : 3
yield info:
    S1   S2     Total    Bin Name
     1    1       2      good
     0    0       1      bad

I want to get the data in a folder full of these txt files and merge the testing results with the following criteria:

process the same batch by identifying which txt files are coming from the same Batch No., same operation (based on the txt file's content not filename)

merge the 2 (or more summary file) data into the following format csv:

 Lot: Operation: Bin First Pass Second Pass Final Yield %Yield Good 7 2 9 90% Bad 3 1 1 10%

S1, S2 is variable, it can go from 1 to 14 but never less than 1. The bins can also have several types on different text files (not only limited to good and bad. but there will always be only 1 good bin)

Bins:
 Good
 Semi-bad
 Bad
 Worst
 ...

I'm new to Python and I only used this scripting language at school, I only know the very basics, nothing more. So this task I want to do is a bit overwhelming to me so I started to process a single text file and get the data I wanted, eg: Batch Number

with open('R0.txt') as fh_d10SunFile:
    fh_d10SumFile_perline = fh_d10SunFile.read().splitlines()
    #print fh_d10SumFile_perline

TestProgramName_str = fh_d10SumFile_perline[CONST.TestProgram_field].split(':')[1]
LotNumber_str       = fh_d10SumFile_perline[CONST.LotNumber_field].split(':')[1]
QtyIn_int           = int( fh_d10SumFile_perline[CONST.UnitsIn_field].split(':')[1] )
TestIteration_str   = fh_d10SumFile_perline[CONST.TestIteration_field].split(':')[1]
TestType_str        = fh_d10SumFile_perline[CONST.TestType_field].split(':')[1]

then grab all the bins in that summary file:

SoftBins_str = filter( lambda x: re.search(r'bin',x),fh_d10SumFile_perline)
for index in range( len(SoftBins_str) ):
    SoftBins_data_str = [l.strip() for l in SoftBins_str[index].split(' ') if l.strip()]
    SoftBins_data_str.reverse()
    bin2bin[SoftBins_data_str[0]] = SoftBins_data_str[2]

then i got stuck because i'm not sure how to do this reading and parsing with several n number of text files containing n number of sites (S1, S2). How do I grab these information from n number of text files, process them in memory (is this even possible with python) and then write the output with computation on the csv output file.

Answer 1

The following should help get you started. As your text files are fixed format, it is relatively simple to read them in and parse them. This script searches for all text files in the current folder, reads each file in and stores the batches in a dictionary based on the batch name so that all batches of the same name area grouped together.

After all files are processed, it creates summaries for each batch and writes them to a single csv output file.

from collections import defaultdict
import glob
import csv

batches = defaultdict(list)

for text_file in glob.glob('*.txt'):
    with open(text_file) as f_input:
        rows = [row.strip() for row in f_input]

    header = [rows[x].split(':')[1].strip() for x in range(1, 6)]
    bins = {}

    for yield_info in rows[8:]:
        s1, s2, total, bin_name = yield_info.split()
        bins[bin_name] = [int(s1), int(s2), int(total)]

    batches[header[0]].append(header + [bins])


with open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output, delimiter='\t')

    for batch, passes in batches.items():
        bins_output = defaultdict(lambda: [[], 0])
        total_yield = 0

        for lot, operation, fruit, op_num, quantity, bins in passes:
            for bin_name, (s1, s2, total) in bins.iteritems():
                bins_output[bin_name][0].append(total)
                bins_output[bin_name][1] += total
                total_yield += total

        csv_output.writerows([['Lot:', lot], ['Operation:', operation]])
        csv_header = ["Bin"] + ['Pass {}'.format(x) for x in range(1, 1 + len(passes))] + ["Final Yield", "%Yield"]        
        csv_output.writerow(csv_header)

        for bin_name in sorted(bins_output.keys()):
            entries, total = bins_output[bin_name]
            percentage_yield = '{:.1f}%'.format((100.0 * total) / total_yield)
            csv_output.writerow([bin_name] + entries + [total, percentage_yield])

        csv_output.writerow([])     # empty row to separate batches

Giving you a tab delimited csv file as follows:

Lot:    A2J3
Operation:  Test
Bin Pass 1  Pass 2  Final Yield %Yield
Bad 3   1   4   30.8%
Good    7   2   9   69.2%

Note, script has been updated to deal with any number of bin types.

Process data from several text files

Question

1 answers

solution1
1 2016-10-17 09:36:35

Process data from several text files

Question

1 answers

solution1 1 2016-10-17 09:36:35

solution1
1 2016-10-17 09:36:35