简体   繁体   中英

Extracting oddly arranged data from csv and converting to another csv file using python

I have a odd csv file thas has data with header value and its corresponding data in a manner as below:

,,,Completed Milling Job,,,,,, # row 1

,,,,Extended Report,,,,,

,,Job Spec numerical control,,,,,,,

Job Number,3456,,,,,, Operator Id,clipper,

Coder Machine Name,Caterpillar,,,,,,Job Start time,3/12/2013 6:22,

Machine type,Stepper motor,,,,,,Job end time,3/12/2013 9:16,

I need to extract the data from this strucutre create another csv file as per the structure below:

Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time,,, # header
Completed Milling Job,3456,Caterpillar,Stepper motor,clipper,3/12/2013 6:22,3/12/2013 9:16,,, # data row

If you notice, there is a new header column added called 'status" but the value is in the first row of the csv file. rest of the column names in output file are extracted from the original file.

Any thoughts will be greatly appreciated - thanks

Assuming the files are all exactly like that (at least in terms of caps) this should work, though I can only guarantee it on the exact data you have supplied:

#!/usr/bin/python
import glob
from sys import argv

g=open(argv[2],'w')
g.write("Status,Job Number,Coder Machine Name,Machine type, Operator Id,Job Start time,Job end time\n")
for fname in glob.glob(argv[1]):
    with open(fname) as f:
        status=f.readline().strip().strip(',')
        f.readline()#extended report not needed
        f.readline()#job spec numerical control not needed
        s=f.readline()
        job_no=s.split('Job Number,')[1].split(',')[0]
        op_id=s.split('Operator Id,')[1].strip().strip(',')
        s=f.readline()
        machine_name=s.split('Coder Machine Name,')[1].split(',')[0]
        start_t=s.split('Job Start time,')[1].strip().strip(',')
        s=f.readline()
        machine_type=s.split('Machine type,')[1].split(',')[0]
        end_t=s.split('Job end time,')[1].strip().strip(',')
    g.write(",".join([status,job_no,machine_name,machine_type,op_id,start_t,end_t])+"\n")
g.close()

It takes a glob argument (like Job*.data ) and an output filename and should construct what you need. Just save it as 'so.py' or something and run it as python so.py <data_files_wildcarded> output.csv

Here is a solution that should work on any CSV files that follow the same pattern as what you showed. That is a seriously nasty format.

I got interested in the problem and worked on it during my lunch break. Here's the code:

COMMA = ','
NEWLINE = '\n'

def _kvpairs_from_line(line):
    line = line.strip()
    values = [item.strip() for item in line.split(COMMA)]

    i = 0
    while i < len(values):
        if not values[i]:
            i += 1  # advance past empty value
        else:
            # yield pair of values
            yield (values[i], values[i+1])
            i += 2  # advance past pair

def kvpairs_by_column_then_row(lines):
    """
    Given a series of lines, where each line is comma-separated values
    organized as key/value pairs like so:
        key_1,value_1,key_n+1,value_n+1,...
        key_2,value_2,key_n+2,value_n+2,...
        ...
        key_n,value_n,key_n+n,value_n+n,...

    Yield up key/value pairs taken from the first column, then from the second column
    and so on.
    """
    pairs = [_kvpairs_from_line(line) for line in lines]
    done = [False for _ in pairs]
    while not all(done):
        for i in range(len(pairs)):
            if not done[i]:
                try:
                    key_value_tuple = next(pairs[i])
                    yield key_value_tuple
                except StopIteration:
                    done[i] = True

STATUS = "Status"
columns = [STATUS]

d = {}

with open("data.csv", "rt") as f:
    # get an iterator that lets us pull lines conveniently from file
    itr = iter(f)

    # pull first line and collect status
    line = next(itr)
    lst = line.split(COMMA)
    d[STATUS] = lst[3]

    # pull next lines and make sure the file is what we expected
    line = next(itr)
    assert "Extended Report" in line
    line = next(itr)
    assert "Job Spec numerical control" in line

    # pull all remaining lines and save in a list
    lines = [line.strip() for line in f]

for key, value in kvpairs_by_column_then_row(lines):
    columns.append(key)
    d[key] = value

with open("output.csv", "wt") as f:
    # write column headers line
    line = COMMA.join(columns)
    f.write(line + NEWLINE)
    # write data row
    line = COMMA.join(d[key] for key in columns)
    f.write(line + NEWLINE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM