[英]How to parse text file in Python and convert to JSON
I have a large file formatted like the following: 我有一个大文件,格式如下:
"string in quotes"
string
string
string
number
|-
...this repeats for a while. ...重复一会儿。 I'm trying to convert it to JSON, so each of the chunks is like this:
我正在尝试将其转换为JSON,所以每个块都是这样的:
"name": "string in quotes"
"description": "string"
"info": "string"
"author": "string"
"year": number
This is what I have so far: 这是我到目前为止的内容:
import shutil
import os
import urllib
myFile = open('unformatted.txt','r')
newFile = open("formatted.json", "w")
newFile.write('{'+'\n'+'list: {'+'\n')
for line in myFile:
newFile.write() // this is where I'm not sure what to write
newFile.write('}'+'\n'+'}')
myFile.close()
newFile.close()
I think I could do something like with the line number modulo something, but I'm not sure if that's the right way to go about it. 我想我可以对行号做一些模运算,但是我不确定这是否是正确的方法。
You can use itertools.groupby to group all the sections then json.dump
the dicts to your json file: 您可以使用itertools.groupby对所有部分进行
json.dump
,然后将json.dump
字典保存到json文件中:
from itertools import groupby
import json
names = ["name", "description","info","author", "year"]
with open("test.csv") as f, open("out.json","w") as out:
grouped = groupby(map(str.rstrip,f), key=lambda x: x.startswith("|-"))
for k,v in grouped:
if not k:
json.dump(dict(zip(names,v)),out)
out.write("\n")
Input: 输入:
"string in quotes"
string
string
string
number
|-
"other string in quotes"
string2
string2
string2
number2
Output: 输出:
{"author": "string", "name": "\"string in quotes\"", "description": "string", "info": "string", "year": "number"}
{"author": "string2", "name": "\"other string in quotes\"", "description": "string2", "info": "string2", "year": "number2"}
To access just iterate over the file and loads: 要访问仅遍历文件并加载:
In [6]: with open("out.json") as out:
for line in out:
print(json.loads(line))
...:
{'name': '"string in quotes"', 'info': 'string', 'author': 'string', 'year': 'number', 'description': 'string'}
{'name': '"other string in quotes"', 'info': 'string2', 'author': 'string2', 'year': 'number2', 'description': 'string2'}
I think this would do the trick. 我认为这可以解决问题。
import itertools
import json
with open('unformatted.txt', 'r') as f_in, open('formatted.json', 'w') as f_out:
for name, desc, info, author, yr, ignore in itertools.izip_longest(*[f_in]*6):
record = {
"name": '"' + name.strip() + '"',
"description": desc.strip(),
"info": info.strip(),
"author": author.strip(),
"year": int(yr.strip()),
}
f_out.write(json.dumps(record))
This is a rough example which does the basic job. 这是一个基本的例子。
It uses a generator to split the input into batches (of 6) first and another one to add the keys to the values. 它使用生成器首先将输入分为六批(每批六批),然后使用另一批将键添加到值中。
import json
def read():
with open('input.txt', 'r') as f:
return [l.strip() for l in f.readlines()]
def batch(content, n=1):
length = len(content)
for num_idx in range(0, length, n):
yield content[num_idx:min(num_idx+n, length)]
def emit(batched):
for n, name in enumerate([
'name', 'description', 'info', 'author', 'year'
]):
yield name, batched[n]
content = read()
batched = batch(content, 6)
res = [dict(emit(b)) for b in batched]
print(res)
with open('output.json', 'w') as f:
f.write(json.dumps(res, indent=4))
Update 更新资料
Using this approach you can easily hook in formatting functions so the year and name values will be correct. 使用这种方法,您可以轻松连接格式函数,以便年份和名称值正确。
Extend the emit function like this: 扩展发射函数,如下所示:
def emit(batched):
def _quotes(q):
return q.replace('"', '')
def _pass(p):
return p
def _num(n):
try:
return int(n)
except ValueError:
return n
for n, (name, func) in enumerate([
('name', _quotes),
('description', _pass),
('info', _pass),
('author', _pass),
('year', _num)
]):
yield name, func(batched[n])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.