简体   繁体   English

如何在Python中解析文本文件并转换为JSON

[英]How to parse text file in Python and convert to JSON

I have a large file formatted like the following: 我有一个大文件,格式如下:

"string in quotes"
string
string
string
number
|-

...this repeats for a while. ...重复一会儿。 I'm trying to convert it to JSON, so each of the chunks is like this: 我正在尝试将其转换为JSON,所以每个块都是这样的:

"name": "string in quotes"
"description": "string"
"info": "string"
"author": "string"
"year": number

This is what I have so far: 这是我到目前为止的内容:

import shutil
import os
import urllib

myFile = open('unformatted.txt','r')
newFile = open("formatted.json", "w")

newFile.write('{'+'\n'+'list: {'+'\n')

for line in myFile:
    newFile.write() // this is where I'm not sure what to write

newFile.write('}'+'\n'+'}')

myFile.close()
newFile.close()

I think I could do something like with the line number modulo something, but I'm not sure if that's the right way to go about it. 我可以对行号做一些模运算,但是我不确定这是否是正确的方法。

You can use itertools.groupby to group all the sections then json.dump the dicts to your json file: 您可以使用itertools.groupby对所有部分进行json.dump ,然后将json.dump字典保存到json文件中:

from itertools import groupby
import json
names = ["name", "description","info","author", "year"]

with open("test.csv") as f, open("out.json","w") as out:
    grouped = groupby(map(str.rstrip,f), key=lambda x: x.startswith("|-"))
    for k,v in grouped:
        if not k:
            json.dump(dict(zip(names,v)),out)
            out.write("\n")

Input: 输入:

"string in quotes"
string
string
string
number
|-
"other string in quotes"
string2
string2
string2
number2

Output: 输出:

{"author": "string", "name": "\"string in quotes\"", "description": "string", "info": "string", "year": "number"}
{"author": "string2", "name": "\"other string in quotes\"", "description": "string2", "info": "string2", "year": "number2"}

To access just iterate over the file and loads: 要访问仅遍历文件并加载:

In [6]: with open("out.json") as out:
            for line in out:
                 print(json.loads(line))
   ...:         
{'name': '"string in quotes"', 'info': 'string', 'author': 'string', 'year': 'number', 'description': 'string'}
{'name': '"other string in quotes"', 'info': 'string2', 'author': 'string2', 'year': 'number2', 'description': 'string2'}

I think this would do the trick. 我认为这可以解决问题。

import itertools
import json

with open('unformatted.txt', 'r') as f_in, open('formatted.json', 'w') as f_out:
    for name, desc, info, author, yr, ignore in itertools.izip_longest(*[f_in]*6):
        record = {
            "name": '"' + name.strip() + '"',
            "description": desc.strip(),
            "info": info.strip(),
            "author": author.strip(),
            "year": int(yr.strip()),
        }
        f_out.write(json.dumps(record))

This is a rough example which does the basic job. 这是一个基本的例子。

It uses a generator to split the input into batches (of 6) first and another one to add the keys to the values. 它使用生成器首先将输入分为六批(每批六批),然后使用另一批将键添加到值中。

import json


def read():
    with open('input.txt', 'r') as f:
        return [l.strip() for l in f.readlines()]


def batch(content, n=1):
    length = len(content)
    for num_idx in range(0, length, n):
        yield content[num_idx:min(num_idx+n, length)]


def emit(batched):
    for n, name in enumerate([
        'name', 'description', 'info', 'author', 'year'
    ]):
        yield name, batched[n]

content = read()
batched = batch(content, 6)
res = [dict(emit(b)) for b in batched]

print(res)

with open('output.json', 'w') as f:
    f.write(json.dumps(res, indent=4))

Update 更新资料

Using this approach you can easily hook in formatting functions so the year and name values will be correct. 使用这种方法,您可以轻松连接格式函数,以便年份名称值正确。

Extend the emit function like this: 扩展发射函数,如下所示:

def emit(batched):
    def _quotes(q):
        return q.replace('"', '')

    def _pass(p):
        return p

    def _num(n):
        try:
            return int(n)
        except ValueError:
            return n

    for n, (name, func) in enumerate([
        ('name', _quotes),
        ('description', _pass),
        ('info', _pass),
        ('author', _pass),
        ('year', _num)
    ]):
        yield name, func(batched[n])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM