简体   繁体   English

使用Python优雅且可扩展地将有效JSON的各行文件转换为CSV

[英]Elegantly and extensibly convert file of individual lines of valid JSON to CSV with Python

So initially my approach was this: 所以最初我的方法是这样的:

output_file = csv.writer(open('transactions000000000029.csv', 'wb+'))
for line in inpt:
    resource = json.loads(line)

    output_file.writerow(['blockNumber','blockHash','hash','from','to','gas','gasUsed','gasPrice','input','logs','nonce','value','timestamp'])
    output_file.writerow(
               [resource['blockNumber'],
                resource['blockHash'],
                resource['hash'],
                resource['from'],
                resource['to'],
                resource['gas'],
                resource['gasUsed'],
                resource['gasPrice'],
                resource['input'],
                resource['logs'],
                resource['nonce'],
                resource['value'],
                resource['timestamp']]
    )

Here's an example of the data I'm using: 这是我正在使用的数据的示例:

{"blockNumber":"1941895","blockHash":"0x53464299a83cecc3e4d930b617c9518b8f74139265423d8110a919f5180bec79","hash":"0x0abe75e40a954d4d355e25e4498f3580e7d029769897d4187c323080a0be0fdd","from":"0x4586ffaf28e08b1613dd96ced9b57d52e8ad9d72","to":"0x91337a300e0361bddb2e377dd4e88ccb7796663d","gas":"21000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"1","value":"0x22c06103f88111000","timestamp":"2016-07-24 20:47:25 UTC"}
{"blockNumber":"1941645","blockHash":"0x78804d09bb4e7126f53133e33e3548e0f04a691c01661ab9b719c3811e54355e","hash":"0x22c2b6490900b21d67ca56066e127fa57c0af973b5d166ca1a4bf52fcb6cf81c","from":"0x81bbf9f19ffe8368efe7611ccf5dcbdb4618b645","to":"0xb01a7866a244dbb600a7bbd170d43d4221838868","gas":"90000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"0","value":"0x4563918244f40000","timestamp":"2016-07-24 19:57:50 UTC"}
{"blockNumber":"1941910","blockHash":"0xc7ba89fc0110a033c4bd03be4505014761141b956c228bc51ec49c15a4508ce4","hash":"0x8570106b0385caf729a17593326db1afe0d75e3f8c6daef25cd4a0499a873a6f","from":"0x91337a300e0361bddb2e377dd4e88ccb7796663d","to":"0x9fde2180b544b7690c35bdc66182eb843ac38030","gas":"90000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"6356","value":"0x41e92b66341ef0000","timestamp":"2016-07-24 20:50:12 UTC"}
{"blockNumber":"1941919","blockHash":"0x4785c1b1a678cf7058e1fed3fc1c7d33c4326c2fb309f5fc75688f23d496b61c","hash":"0x8adfe7fc3cf0eb34bb56c59fa3dc4fdd3ec3f3514c0100fef800f065219b7707","from":"0x69ca903e87329fd63a3c7b2d3efde6a9bf3c3d45","to":"0xbfc39b6f805a9e40e77291aff27aee3c96915bdd","gas":"40000","gasUsed":"29130","gasPrice":"30000000000","input":"","logs":[{"address":"0xbfc39b6f805a9e40e77291aff27aee3c96915bdd","topics":["0x23919512b2162ddc59b67a65e3b03c419d4105366f7d4a632f5d3c3bee9b1cff"],"data":"AAAAAAAAAAAAAAAAwNMyg48U70L83hzyUYxCfdtnZyk="}],"nonce":"20","value":"0x1d2eb2accbaf90800","timestamp":"2016-07-24 20:52:08 UTC"}
{"blockNumber":"1941922","blockHash":"0xd46dbf526f6d7c9197e841c8a4d7b2f4abdac4a62860cffabb943a46d07a86d4","hash":"0x8b0fe2b7727664a14406e7377732caed94315b026b37577e2d9d258253067553","from":"0x0b2c5cba2dc240e867f7721412c20e6016596d26","to":"0x9c83fe12c7575ea7350019e04253d3620957851f","gas":"21000","gasUsed":"21000","gasPrice":"21000000000","input":"","logs":[],"nonce":"2","value":"0x7ce66c50e2840000","timestamp":"2016-07-24 20:52:51 UTC"}
{"blockNumber":"1941688","blockHash":"0x86bb1e90d0fa7be11d3f196057976383bb73cbd1596992e868155a576b5ddfb9","hash":"0x244b29b60c696f4ab07c36342344fe6116890f8056b4abc9f734f7a197c93341","from":"0x006cdc135b4e3a89d3ac1027ec3de609b8fff500","to":"0x58ae42a38d6b33a1e31492b60465fa80da595755","gas":"50000","gasUsed":"50000","gasPrice":"20000000000","input":"","logs":[],"nonce":"47","value":"0xc7140013deaf40","error":"invalid jump destination (PUSH1) 2","timestamp":"2016-07-24 20:06:38 UTC"}
{"blockNumber":"1941794","blockHash":"0x41ee74e34cbf9ef4116febea958dbc260e2da3a6bf6f601bfaeb2cd9ab944a29","hash":"0xf2b5b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63","from":"0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9","to":"0x2ba24c66cbff0bda0e3053ea07325479b3ed1393","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"14","value":"0x24406420d09ce7440000","timestamp":"2016-07-24 20:28:11 UTC"}
{"blockNumber":"1941716","blockHash":"0x75e1602cad967a781f4a2ea9e19c97405fe1acaa8b9ad333fb7288d98f7b49e3","hash":"0xf8f2a397b0f7bb1ff212b6bcc57e4a56ce3e27eb9f5839fef3e193c0252fab26","from":"0xa0480c6f402b036e33e46f993d9c7b93913e7461","to":"0xb2ea1f1f997365d1036dd6f00c51b361e9a3f351","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"1","value":"0xde0b6b3a7640000","timestamp":"2016-07-24 20:12:17 UTC"}
{"blockNumber":"1941794","blockHash":"0x41ee74e34cbf9ef4116febea958dbc260e2da3a6bf6f601bfaeb2cd9ab944a29","hash":"0xf275b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63","from":"0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9","to":"0x2ba24c66cbff0bda0e3053ea07325479b3ed1393","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"14","value":"0x24406420d09ce7440000","timestamp":"2016-07-24 20:28:11 UTC"}
{"blockNumber":"1941794","blockHash":"0x41ee74e34cbf9ef4116febea958dbc260e2da3a6bf6f601bfaeb2cd9ab944a29","hash":"0xf285b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63","from":"0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9","to":"0x2ba24c66cbff0bda0e3053ea07325479b3ed1393","gas":"121000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"14","value":"0x24406420d09ce7440000","timestamp":"2016-07-24 20:28:11 UTC"}
{"blockNumber":"1941895","blockHash":"0x53464299a83cecc3e4d930b617c9518b8f74139265423d8110a919f5180bec79","hash":"0x0abg75e40a954d4d355e25e4498f3580e7d029769897d4187c323080a0be0fdd","from":"0x4586ffaf28e08b1613dd96ced9b57d52e8ad9d72","to":"0x91337a300e0361bddb2e377dd4e88ccb7796663d","gas":"21000","gasUsed":"21000","gasPrice":"20000000000","input":"","logs":[],"nonce":"1","value":"0x22c06103f88111000","timestamp":"2016-07-24 20:47:25 UTC"}

But when I execute it on my real dataset it breaks with the following error: 但是,当我在真实的数据集上执行它时,它会因以下错误而中断:

Traceback (most recent call last):
  File "csv-ifier.py", line 19, in <module>
    resource['to'],
KeyError: 'to'

My question is- is there a way to do this more flexibly/elegantly/dynamically? 我的问题是-有没有办法更灵活/优雅/动态地做到这一点? Perhaps without having to specify all of the fields in advance? 也许不必事先指定所有字段?

Do I need to use try/catch so that it doesn't break? 我是否需要使用try / catch以便它不会中断?

I would store the key list, and query each key in the dictionary at every row write, with default value using dict.get 我将存储键列表,并在每次写入行时查询字典中的每个键,并使用dict.get获得默认值

keys = ['blockNumber','blockHash','hash','from','to','gas','gasUsed','gasPrice','input','logs','nonce','value','timestamp']

with open('transactions000000000029.csv', 'w') as f:
    output_file = csv.writer(f)
    output_file.writerow(keys)
    for line in inpt:
        resource = json.loads(line)

        output_file.writerow([resource.get(k,"") for k in keys])

There is an even better way to do that using csv.DictWriter with restval set to empty to avoid error if key is missing (and extrasaction='ignore' to ignore keys that are in the dictionary, but not in the key list, maybe that's not needed in your case): 有一种更好的方法是使用csv.DictWriter并将restval设置为空,以避免在缺少键的情况下出错(而extrasaction='ignore'忽略字典中但不在键列表中的键,也许是在您的情况下不需要):

keys = ['blockNumber','blockHash','hash','from','to','gas','gasUsed','gasPrice','input','logs','nonce','value','timestamp']

with open('transactions000000000029.csv', 'w') as f:
    output_file = csv.DictWriter(f,fieldnames=keys,extrasaction='ignore',restval="")
    output_file.writeheader()
    for line in inpt:
        resource = json.loads(line)
        output_file.writerow(resource)

If you're not opposed to using pandas you could potentially do something like 如果您不反对使用熊猫,则可能会做类似的事情

import pandas as pd
pd.read_json('transactions000000000029.csv')
pd.to_csv('realcsv.csv')

assuming that pandas recognizes your json input 假设熊猫可以识别您的json输入

You could define a custom getter that checks if the key is in the dictionary before trying to get it. 您可以定义一个自定义的getter,在尝试获取密钥之前先检查密钥是否在字典中。 Also, you should write the header outside the for-loop: 另外,您应该在for循环外编写标头:

def custom_getter(my_dict, my_key):
    # If the key is in the dictionary, we return its value
    if my_key in my_dict:
        return my_dict[my_key]
    # If the key is NOT in the dictionary, we return an empty string
    return ''

output_file = csv.writer(open('transactions000000000029.csv', 'wb+'))
output_file.writerow(['blockNumber','blockHash','hash','from','to','gas','gasUsed','gasPrice','input','logs','nonce','value','timestamp'])
for line in inpt:
    resource = json.loads(line)
    output_file.writerow(
               [custom_getter(resource,'blockNumber'),
                custom_getter(resource,'blockHash'),
                custom_getter(resource,'hash'),
                custom_getter(resource,'from'),
                custom_getter(resource,'to'),
                custom_getter(resource,'gas'),
                custom_getter(resource,'gasUsed'),
                custom_getter(resource,'gasPrice'),
                custom_getter(resource,'input'),
                custom_getter(resource,'logs'),
                custom_getter(resource,'nonce'),
                custom_getter(resource,'value'),
                custom_getter(resource,'timestamp')]
    )

It would be a good idea to also apply what @Jean-François Fabre suggested, which is to declare the keys previously and then use list comprehension to build the output row. 最好也应用@Jean-FrançoisFabre的建议,即先声明键,然后使用列表推导构建输出行。 If you want, I can add it to my answer. 如果您愿意,我可以将其添加到我的答案中。

EDIT 编辑

I went ahead and edited anyway, I wanted to leave a better piece of code: 我继续进行编辑,我想留下更好的代码:

def custom_getter(my_dict, my_key):
    # If the key is in the dictionary, we return its value
    if my_key in my_dict:
        return my_dict[my_key]
    # If the key is NOT in the dictionary, we return an empty string
    return ''

keys = ['blockNumber','blockHash','hash','from','to','gas','gasUsed','gasPrice','input','logs','nonce','value','timestamp']
output_file = csv.writer(open('transactions000000000029.csv', 'wb+'))
output_file.writerow(keys)
for line in inpt:
    resource = json.loads(line)
    output_file.writerow([custom_getter(resource, k) for k in keys])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM