[英]parsing text file with JSON-like object into CSV
I have a text file containing key-value pairs, with the last two key-value pairs containing JSON-like objects that I would like to split out into columns and write with the other values, using the keys as column headings. 我有一个包含键值对的文本文件,最后两个键值对包含类似JSON的对象,我想将这些对象拆分为列,并使用键作为列标题与其他值一起写入。 The first three rows of the data file input.txt
look like this: 数据文件input.txt
前三行如下所示:
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::44.6743867864386,Length3dCenterToCenter::44.6768028159989,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.53962362760298}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::57.8689351603823,Length3dCenterToCenter::57.8700464193429,Tag::<NULL>,{StartPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.075},{EndPoint::7822.85045874375[%2C]1730.80294308742[%2C]-3.43363070193163}
InnerDiameterOrWidth::0.1,InnerHeight::0.1,Length2dCenterToCenter::68.7161350545728,Length3dCenterToCenter::68.7172034962765,Tag::<NULL>,{StartPoint::7858.35924983374[%2C]1703.69341358077[%2C]-3.075},{EndPoint::7793.52927597915[%2C]1680.91224357457[%2C]-3.45819643838485}
and we eventually came up with something that worked, but there must be a much better way: 我们最终想出了一些可行的方法,但是必须有一种更好的方法:
import csv
with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for i, line in enumerate(reader):
mysplit = [item.split('::') for item in line if item.strip()]
if not mysplit: # blank line
continue
keys, vals = zip(*mysplit)
start_vals = [item.split('[%2C]') for item in mysplit[-2]]
end_vals = [item.split('[%2C]') for item in mysplit[-1]]
a=list(keys[0:-2])
a.extend(['start1','start2','start3','end1','end2','end3'])
b=list(vals[0:-2])
b.append(start_vals[1][0])
b.append(start_vals[1][1])
b.append(start_vals[1][2][:-1])
b.append(end_vals[1][0])
b.append(end_vals[1][1])
b.append(end_vals[1][2][:-1])
if i == 0:
# if first line: write header
writer.writerow(a)
writer.writerow(b)
which produces the output file output.csv
that looks like this 生成如下输出文件output.csv
InnerDiameterOrWidth,InnerHeight,Length2dCenterToCenter,Length3dCenterToCenter,Tag,start1,start2,start3,end1,end2,end3
0.1,0.1,44.6743867864386,44.6768028159989,<NULL>,7858.35924983374,1703.69341358077,-3.075,7822.85045874375,1730.80294308742,-3.53962362760298
0.1,0.1,57.8689351603823,57.8700464193429,<NULL>,7793.52927597915,1680.91224357457,-3.075,7822.85045874375,1730.80294308742,-3.43363070193163
0.1,0.1,68.7161350545728,68.7172034962765,<NULL>,7858.35924983374,1703.69341358077,-3.075,7793.52927597915,1680.91224357457,-3.45819643838485
We don't want to write code like this in the future. 我们不想将来再写这样的代码。
What is the best way to read data like this? 读取这样的数据的最佳方法是什么?
I'd use: 我会用:
from itertools import chain
import csv
_header_translate = {
'StartPoint': ('start1', 'start2', 'start3'),
'EndPoint': ('end1', 'end2', 'end3')
}
def header(col):
header = col.strip('{}').split('::', 1)[0]
return _header_translate.get(header, (header,))
def cleancolumn(col):
col = col.strip('{}').split('::', 1)[1]
return col.split('[%2C]')
def chainedmap(func, row):
return list(chain.from_iterable(map(func, row)))
with open('input.txt', 'rb') as fin, open('output.csv', 'wb') as fout:
reader = csv.reader(fin)
writer = csv.writer(fout)
for i, row in enumerate(reader):
if not i: # first row, write header first
writer.writerow(chainedmap(header, row))
writer.writerow(chainedmap(cleancolumn, row))
The cleancolumn
method takes any of your columns and returns a tuple (possibly with only one value) after removing the braces, removing everything before the first ::
and splitting on the embedded 'comma'. cleancolumn
方法使用您的任何列,并在删除花括号,删除第一个::
之前的所有内容并在嵌入式“逗号”上分割后返回一个元组(可能只有一个值)。 By using itertools.chain.from_iterable()
we turn the series of tuples generated from the columns into one list again for the csv writer. 通过使用itertools.chain.from_iterable()
我们将从列生成的元组系列再次转换为csv编写器的一个列表。
When handling the first line we generate one header row from the same columns, replacing the StartPoint
and EndPoint
headers with the 6 expanded headers. 处理第一行时,我们从相同的列中生成一个标题行,用6个扩展标题替换StartPoint
和EndPoint
标题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.