简体   繁体   English

将带有JSON类对象的文本文件解析为CSV

[英]parsing text file with JSON-like object into CSV

I have a text file containing key-value pairs, with the last two key-value pairs containing JSON-like objects that I would like to split out into columns and write with the other values, using the keys as column headings. 我有一个包含键值对的文本文件,最后两个键值对包含类似JSON的对象,我想将这些对象拆分为列,并使用键作为列标题与其他值一起写入。 The first three rows of the data file input.txt look like this: 数据文件input.txt前三行如下所示:

InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::44.6743867864386,Length3dCenterToCenter::44.6768028159989,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.53962362760298}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::57.8689351603823,Length3dCenterToCenter::57.8700464193429,Tag::<NULL>,{StartPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.43363070193163}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::68.7161350545728,Length3dCenterToCenter::68.7172034962765,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.45819643838485}

and we eventually came up with something that worked, but there must be a much better way: 我们最终想出了一些可行的方法,但是必须有一种更好的方法:

import csv
with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
    reader = csv.reader(fin)
    writer = csv.writer(fout)
    for i, line in enumerate(reader):
        mysplit = [item.split('::') for item in line if item.strip()]
        if not mysplit: # blank line
            continue
        keys, vals = zip(*mysplit)
        start_vals = [item.split('[%2C]') for item in mysplit[-2]]
        end_vals = [item.split('[%2C]') for item in mysplit[-1]]
        a=list(keys[0:-2])
        a.extend(['start1','start2','start3','end1','end2','end3'])
        b=list(vals[0:-2])
        b.append(start_vals[1][0])
        b.append(start_vals[1][1])
        b.append(start_vals[1][2][:-1])
        b.append(end_vals[1][0])
        b.append(end_vals[1][1])
        b.append(end_vals[1][2][:-1])
        if i == 0:
            # if first line: write header
            writer.writerow(a)
        writer.writerow(b)

which produces the output file output.csv that looks like this 生成如下输出文件output.csv

InnerDiameterOrWidth,InnerHeight,Length2dCenterToCenter,Length3dCenterToCenter,Tag,start1,start2,start3,end1,end2,end3
0.1,0.1,44.6743867864386,44.6768028159989,<NULL>,7858.35924983374,1703.69341358077,-3.075,7822.85045874375,1730.80294308742,-3.53962362760298
0.1,0.1,57.8689351603823,57.8700464193429,<NULL>,7793.52927597915,1680.91224357457,-3.075,7822.85045874375,1730.80294308742,-3.43363070193163
0.1,0.1,68.7161350545728,68.7172034962765,<NULL>,7858.35924983374,1703.69341358077,-3.075,7793.52927597915,1680.91224357457,-3.45819643838485

We don't want to write code like this in the future. 我们不想将来再写这样的代码。

What is the best way to read data like this? 读取这样的数据的最佳方法是什么?

I'd use: 我会用:

from itertools import chain
import csv

_header_translate = {
    'StartPoint': ('start1', 'start2', 'start3'),
    'EndPoint': ('end1', 'end2', 'end3')
}

def header(col):
    header = col.strip('{}').split('::', 1)[0]
    return _header_translate.get(header, (header,))

def cleancolumn(col):
    col = col.strip('{}').split('::', 1)[1]
    return col.split('[%2C]')

def chainedmap(func, row):
    return list(chain.from_iterable(map(func, row)))

with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
    reader = csv.reader(fin)
    writer = csv.writer(fout)
    for i, row in enumerate(reader):
        if not i:  # first row, write header first
            writer.writerow(chainedmap(header, row))
        writer.writerow(chainedmap(cleancolumn, row))

The cleancolumn method takes any of your columns and returns a tuple (possibly with only one value) after removing the braces, removing everything before the first :: and splitting on the embedded 'comma'. cleancolumn方法使用您的任何列,并在删除花括号,删除第一个::之前的所有内容并在嵌入式“逗号”上分割后返回一个元组(可能只有一个值)。 By using itertools.chain.from_iterable() we turn the series of tuples generated from the columns into one list again for the csv writer. 通过使用itertools.chain.from_iterable()我们将从列生成的元组系列再次转换为csv编写器的一个列表。

When handling the first line we generate one header row from the same columns, replacing the StartPoint and EndPoint headers with the 6 expanded headers. 处理第一行时,我们从相同的列中生成一个标题行,用6个扩展标题替换StartPointEndPoint标题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM