简体   繁体   English

快速将 JSON 列转换为 Pandas 数据框

[英]Fast convert JSON column into Pandas dataframe

I'm reading data from a database (50k+ rows) where one column is stored as JSON.我正在从数据库(50k+ 行)中读取数据,其中一列存储为 JSON。 I want to extract that into a pandas dataframe.我想将其提取到熊猫数据框中。 The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db.下面的代码片段工作正常,但效率相当低,并且在针对整个数据库运行时确实需要很长时间。 Note that not all the items have the same attributes and that the JSON have some nested attributes.请注意,并非所有项目都具有相同的属性,并且 JSON 具有一些嵌套属性。

How could I make this faster?我怎么能让这个更快?

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

df.data.apply(json.loads) \
       .apply(pd.io.json.json_normalize)\
       .pipe(lambda x: pd.concat(x.values))
###this returns a dataframe where each JSON key is a column

json_normalize takes an already processed json string or a pandas series of such strings. json_normalize接受一个已经处理过的 json 字符串或一个Pandas系列这样的字符串。

pd.io.json.json_normalize(df.data.apply(json.loads))

setup设置

import pandas as pd
import json

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

I think you can first convert string column data to dict , then create list of numpy arrays by values and last DataFrame.from_records :我认为您可以先将stringdata转换为dict ,然后按values和最后一个DataFrame.from_records创建numpy arrays list

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \
                 header=None, index_col=0, names=['data'])

a = df.data.apply(json.loads).values.tolist() 
print (pd.DataFrame.from_records(a))

Another idea:另一个想法:

 df = pd.json_normalize(df['data'])

data = { "events":[数据 = {“事件”:[
{ {
"timemillis":1563467463580, "date":"18.7.2019", "time":"18:31:03,580", "name":"Player is loading", "data":"" }, { "timemillis":1563467463580, "date":"18.7.2019", "time":"18:31:03,580", "name":"播放器正在加载", "data":"" }, {
"timemillis":1563467463668, "date":"18.7.2019", "time":"18:31:03,668", "name":"Player is loaded", "data":"5" } ] } "timemillis":1563467463668, "date":"18.7.2019", "time":"18:31:03,668", "name":"播放器已加载", "data":"5" } ] }

from pandas.io.json import json_normalize
result = json_normalize(data,'events')
print(result)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM