[英]Parsing unstructured json into csv
I have yearly application data for different apps in json format.我有 json 格式的不同应用程序的年度应用程序数据。 There are 10 different json files for each application.每个应用程序有 10 个不同的 json 文件。 I try to merge them into a single csv.我尝试将它们合并为一个 csv。 Let me first show you the data structure:先给大家看一下数据结构:
[{"date": "2017-10-23", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
When I parse them into pandas dataframe I get something like this:当我将它们解析为 Pandas 数据框时,我得到如下信息:
date downloads end data
2017-10-23 15358985 2017-10-23 {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}
2017-10-22 12778233 2017-10-22 {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}
Please notice that not all of the versions are downloaded everyday.请注意,并非每天都会下载所有版本。 How I could create a column for different versions of the application?我如何为不同版本的应用程序创建一个列? If the application is not downloaded on particular day we could leave it blank or fill with NaNs如果应用程序未在特定日期下载,我们可以将其留空或填写 NaN
I think you need DataFrame
constructor with reindex
for add missing rows:我认为您需要带有reindex
DataFrame
构造函数来添加丢失的行:
j = [{"date": "2017-10-25", "downloads": 15358985, "end": "2017-10-23", "data": {"2.7.3.4196-beta": 7, "1.0.1": 268, "1.0.2": 715, "2.9.0.4250-beta": 1, "2.7.3.4215-beta": 2, "2.7.2.4151-beta": 1, "2.2.3.1-signed": 9292}}, {"date": "2017-10-22", "downloads": 12778233, "end": "2017-10-22", "data": {"2.7.3.4196-beta": 5, "2.4.1": 842, "2.99.0.1872beta": 12, "2.99.0.1857beta": 4, "2.3.1.1-signed": 3, "2.6.10": 11538, "2.6.4.1-signed": 8, "2.7.3.4198-beta": 4}}]
df = pd.DataFrame(j).set_index('date')
df.index = pd.to_datetime(df.index)
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data downloads \
2017-10-22 {'2.6.4.1-signed': 8, '2.99.0.1857beta': 4, '2... 12778233.0
2017-10-23 NaN NaN
2017-10-24 NaN NaN
2017-10-25 {'2.7.2.4151-beta': 1, '1.0.1': 268, '2.9.0.42... 15358985.0
end
2017-10-22 2017-10-22
2017-10-23 NaN
2017-10-24 NaN
2017-10-25 2017-10-23
Solution with json_normalize
, but if different formats of json
s get a lot of NaN
s values:使用json_normalize
解决方案,但如果不同格式的json
s 得到很多NaN
s 值:
df = json_normalize(j).set_index('date')
df.index = pd.to_datetime(df.index)
#
df = df.reindex(pd.date_range(start=df.index.min(), end=df.index.max()))
print (df)
data.1.0.1 data.1.0.2 data.2.2.3.1-signed data.2.3.1.1-signed \
2017-10-22 NaN NaN NaN 3.0
2017-10-23 NaN NaN NaN NaN
2017-10-24 NaN NaN NaN NaN
2017-10-25 268.0 715.0 9292.0 NaN
data.2.4.1 data.2.6.10 data.2.6.4.1-signed \
2017-10-22 842.0 11538.0 8.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN NaN NaN
data.2.7.2.4151-beta data.2.7.3.4196-beta data.2.7.3.4198-beta \
2017-10-22 NaN 5.0 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 1.0 7.0 NaN
data.2.7.3.4215-beta data.2.9.0.4250-beta data.2.99.0.1857beta \
2017-10-22 NaN NaN 4.0
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 2.0 1.0 NaN
data.2.99.0.1872beta downloads end
2017-10-22 12.0 12778233.0 2017-10-22
2017-10-23 NaN NaN NaN
2017-10-24 NaN NaN NaN
2017-10-25 NaN 15358985.0 2017-10-23
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.