![](/img/trans.png)
[英]How to write Json null value as an empty line in new file (converting json based log into column format, i.e., one file per column)
[英]converting json based log into column format, i.e., one file per column
日志文件示例:
{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}
它將生成 5 個文件:
列文件格式如下:
timestamp.column 的示例內容:
2022-01-14T00:12:21.000
2022-01-18T00:15:51.000
注意:日志中的字段將是動態的,不要假設這些是預期的屬性
誰能告訴我該怎么做
日志文件的大小約為 4GB 到 48GB
如果每個 JSON 都在單行中,那么您可以open()
文件並使用for line in file:
逐行讀取 - 接下來您可以使用模塊json
將行轉換為字典並處理它。
您可以使用for key, value in data:
分別處理每個項目。 可以使用key
創建文件名f"{key}.column"
,以append模式"a"
打開,在該文件中寫入str(value) + "\n"
。
因為你有嵌套的字典所以你需要isinstance(value, dict)
來檢查你是否沒有{"f1": 0, "f2": 1.7, "f3": 2}
和這個字典的重復代碼 - 和這個可能需要使用遞歸。
最少的工作代碼。
我只使用io
來模擬 memory 中的文件,但你應該使用open(filename)
file_data = '''{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}'''
import json
# --- functions ---
def process_dict(data, prefix=""):
for key, value in data.items():
if prefix:
key = prefix + "." + key
if isinstance(value, dict):
process_dict(value, key)
else:
with open(key + '.column', "a") as f:
f.write(str(value) + "\n")
# --- main ---
#file_obj = open("filename")
import io
file_obj = io.StringIO(file_data) # emulate file in memory
for line in file_obj:
data = json.loads(line)
print(data)
process_dict(data)
#process_dict(data, "some prefix for all files")
編輯:
更通用的版本 - 它獲得function
作為第三個參數,因此它可以與不同的功能一起使用
file_data = '''{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}'''
import json
# --- functions ---
def process_dict(data, func, prefix=""):
for key, value in data.items():
if prefix:
key = prefix + "." + key
if isinstance(value, dict):
process_dict(value, func, key)
else:
func(key, value)
def write_func(key, value):
with open(key + '.column', "a") as f:
f.write(str(value) + "\n")
# --- main ---
#file_obj = open("filename")
import io
file_obj = io.StringIO(file_data) # emulate file in memory
for line in file_obj:
data = json.loads(line)
print(data)
process_dict(data, write_func)
#process_dict(data, write_func, "some prefix for all files")
使它更通用的另一個想法是創建 function 來展平 dict 並創建
{'timestamp': '2022-01-14T00:12:21.000', 'Field1': 10, 'Field_Doc.f1': 0}
{'timestamp': '2022-01-18T00:15:51.000', 'Field_Doc.f1': 0, 'Field_Doc.f2': 1.7, 'Field_Doc.f3': 2}
然后使用循環寫入元素。
file_data = '''{"timestamp": "2022-01-14T00:12:21.000", "Field1": 10, "Field_Doc": {"f1": 0}}
{"timestamp": "2022-01-18T00:15:51.000", "Field_Doc": {"f1": 0, "f2": 1.7, "f3": 2}}'''
import json
# --- functions ---
def flatten_dict(data, prefix=""):
result = {}
for key, value in data.items():
if prefix:
key = prefix + "." + key
if isinstance(value, dict):
result.update( process_dict(value, key) )
else:
result[key] = value
#result.update( {key: value} )
return result
# --- main ---
#file_obj = open("filename")
import io
file_obj = io.StringIO(file_data) # emulate file in memory
for line in file_obj:
data = json.loads(line)
print('before:', data)
data = flatten_dict(data)
#data = flatten_dict(data, "some prefix for all items")
print('after :', data)
print('---')
for key, value in data.items():
with open(key + '.column', "a") as f:
f.write(str(value) + "\n")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.