[英]Parsing unstructured json into csv
我有 json 格式的不同應用程序的年度應用程序數據。 每個應用程序有 10 個不同的 json 文件。 我嘗試將它們合並為一個 csv。 先給大家看一下數據結構:
[{"date": "2017-10-23", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
當我將它們解析為 Pandas 數據框時,我得到如下信息:
date downloads end data
2017-10-23 15358985 2017-10-23 {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}
2017-10-22 12778233 2017-10-22 {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}
請注意,並非每天都會下載所有版本。 我如何為不同版本的應用程序創建一個列? 如果應用程序未在特定日期下載,我們可以將其留空或填寫 NaN
我認為您需要帶有reindex
DataFrame
構造函數來添加丟失的行:
j = [{"date": "2017-10-25", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
df = pd.DataFrame(j).set_index('date')
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data downloads \
2017-10-22 {'2.6.4.1-signed': 8, '2.99.0.1857beta': 4, '2... 12778233.0
2017-10-23 NaN NaN
2017-10-24 NaN NaN
2017-10-25 {'2.7.2.4151-beta': 1, '1.0.1': 268, '2.9.0.42... 15358985.0
end
2017-10-22 2017-10-22
2017-10-23 NaN
2017-10-24 NaN
2017-10-25 2017-10-23
使用json_normalize
解決方案,但如果不同格式的json
s 得到很多NaN
s 值:
df = json_normalize(j).set_index('date')
df.index = pd.to_datetime(df.index)
#
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data.1.0.1 data.1.0.2 data.2.2.3.1-signed data.2.3.1.1-signed \
2017-10-22 NaN NaN NaN 3.0
2017-10-23 NaN NaN NaN NaN
2017-10-24 NaN NaN NaN NaN
2017-10-25 268.0 715.0 9292.0 NaN
data.2.4.1 data.2.6.10 data.2.6.4.1-signed \
2017-10-22 842.0 11538.0 8.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN NaN NaN
data.2.7.2.4151-beta data.2.7.3.4196-beta data.2.7.3.4198-beta \
2017-10-22 NaN 5.0 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 1.0 7.0 NaN
data.2.7.3.4215-beta data.2.9.0.4250-beta data.2.99.0.1857beta \
2017-10-22 NaN NaN 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 2.0 1.0 NaN
data.2.99.0.1872beta downloads end
2017-10-22 12.0 12778233.0 2017-10-22
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN 15358985.0 2017-10-23
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.