简体   繁体   中英

parsing text file with JSON-like object into CSV

I have a text file containing key-value pairs, with the last two key-value pairs containing JSON-like objects that I would like to split out into columns and write with the other values, using the keys as column headings. The first three rows of the data file input.txt look like this:

InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::44.6743867864386,Length3dCenterToCenter::44.6768028159989,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.53962362760298}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::57.8689351603823,Length3dCenterToCenter::57.8700464193429,Tag::<NULL>,{StartPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.43363070193163}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::68.7161350545728,Length3dCenterToCenter::68.7172034962765,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.45819643838485}

and we eventually came up with something that worked, but there must be a much better way:

import csv
with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
    reader = csv.reader(fin)
    writer = csv.writer(fout)
    for i, line in enumerate(reader):
        mysplit = [item.split('::') for item in line if item.strip()]
        if not mysplit: # blank line
            continue
        keys, vals = zip(*mysplit)
        start_vals = [item.split('[%2C]') for item in mysplit[-2]]
        end_vals = [item.split('[%2C]') for item in mysplit[-1]]
        a=list(keys[0:-2])
        a.extend(['start1','start2','start3','end1','end2','end3'])
        b=list(vals[0:-2])
        b.append(start_vals[1][0])
        b.append(start_vals[1][1])
        b.append(start_vals[1][2][:-1])
        b.append(end_vals[1][0])
        b.append(end_vals[1][1])
        b.append(end_vals[1][2][:-1])
        if i == 0:
            # if first line: write header
            writer.writerow(a)
        writer.writerow(b)

which produces the output file output.csv that looks like this

InnerDiameterOrWidth,InnerHeight,Length2dCenterToCenter,Length3dCenterToCenter,Tag,start1,start2,start3,end1,end2,end3
0.1,0.1,44.6743867864386,44.6768028159989,<NULL>,7858.35924983374,1703.69341358077,-3.075,7822.85045874375,1730.80294308742,-3.53962362760298
0.1,0.1,57.8689351603823,57.8700464193429,<NULL>,7793.52927597915,1680.91224357457,-3.075,7822.85045874375,1730.80294308742,-3.43363070193163
0.1,0.1,68.7161350545728,68.7172034962765,<NULL>,7858.35924983374,1703.69341358077,-3.075,7793.52927597915,1680.91224357457,-3.45819643838485

We don't want to write code like this in the future.

What is the best way to read data like this?

I'd use:

from itertools import chain
import csv

_header_translate = {
    'StartPoint': ('start1', 'start2', 'start3'),
    'EndPoint': ('end1', 'end2', 'end3')
}

def header(col):
    header = col.strip('{}').split('::', 1)[0]
    return _header_translate.get(header, (header,))

def cleancolumn(col):
    col = col.strip('{}').split('::', 1)[1]
    return col.split('[%2C]')

def chainedmap(func, row):
    return list(chain.from_iterable(map(func, row)))

with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
    reader = csv.reader(fin)
    writer = csv.writer(fout)
    for i, row in enumerate(reader):
        if not i:  # first row, write header first
            writer.writerow(chainedmap(header, row))
        writer.writerow(chainedmap(cleancolumn, row))

The cleancolumn method takes any of your columns and returns a tuple (possibly with only one value) after removing the braces, removing everything before the first :: and splitting on the embedded 'comma'. By using itertools.chain.from_iterable() we turn the series of tuples generated from the columns into one list again for the csv writer.

When handling the first line we generate one header row from the same columns, replacing the StartPoint and EndPoint headers with the 6 expanded headers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM