[英]Read large json(>5gb) file by line by line and process each line and create DataFrame using Pandas
我正在逐行讀取文件並處理每一行。但是我沒有得到所需的輸出。
inputfile.txt
{"M":{"1":"data","2":"esf"},"D":{"4":12312,"6":"err"},"R":{"33":"eres","wer":454}}
{"M":{"1":"a","2":"2"},"D":{"4":3456,"6":"esrr"},"R":{"33":"esre","wer":447}}
{"M":{"1":"data3","2":"fer"},"D":{"4":9873,"6":"errs"},"R":{"33":"eret","wer":189,"55":"rt"}}
碼:
import pandas as pd;
import json
with open("inputfile.txt") as f:
for line in f:
data=(json.loads(f))
d=[{k1+k2:v2 for k2,v2 in v1.items()} for k1,v1 in data.items()]
keys=[k for x in d for k in x.items()]
keys=list(set(keys))
df=pd.DataFrame(d,columns=keys)
print (df)
我需要的輸出:
M1,M2,D4,D6,R33,Rwer,R55
data,esf,12312,err,eres,454,NA
a,2,3456,esrr,esre,447,NA
data3,fer,9873,errs,eret,189,rt
嘗試
with open("inputfile.txt") as f:
for line in f:
proccess_lines(json.loads(line))
您必須閱讀一次文件,並將每行作為Json字符串加載,然后使用您的處理。 代碼可以是:
df = pd.DataFrame([{k1+k2:v2 for k1,v1 in data.items() for k2,v2 in v1.items()}
for data in [json.loads(line) for line in io.StringIO(t)]])
這將建立一個列表理解,每行包含一個詞典,最后從中構建一個數據框。
使用您的樣本數據,我得到:
D4 D6 M1 M2 R33 R55 Rwer
0 12312 err data esf eres NaN 454
1 3456 esrr a 2 esre NaN 447
2 9873 errs data3 fer eret rt 189
如果要重新排列列,只需使用:
df[['M1', 'M2', 'D4', 'D6', 'R33', 'Rwer', 'R55']]
給予預期:
M1 M2 D4 D6 R33 Rwer R55
0 data esf 12312 err eres 454 NaN
1 a 2 3456 esrr esre 447 NaN
2 data3 fer 9873 errs eret 189 rt
使用中間文本I / O緩沖區的擴展解決方案(還充當上下文管理器):
import pandas as pd
import json
import io
with open('input.json') as f, io.StringIO() as temp_file:
for line in f:
d = {}
json_data = json.loads(line)
d = {k + sub_k: val for k, inner_d in json_data.items()
for sub_k, val in inner_d.items()}
temp_file.write(json.dumps(d) + '\n')
temp_file.seek(0)
df = pd.read_json(temp_file, orient='columns', lines=True)
print(df.to_string())
樣本輸出:
D4 D6 M1 M2 R33 R55 Rwer
0 12312 err data esf eres NaN 454
1 3456 esrr a 2 esre NaN 447
2 9873 errs data3 fer eret rt 189
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.