[英]Convert python nested JSON-like data to dataframe
My records looks like this and I need to write it to a csv file: 我的记录如下所示,我需要将其写入一个csv文件中:
my_data={"data":[{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}]}
which looks like json, but the next record starts with "data"
and not "data1"
which forces me to read each record separately. 看起来像json,但下一条记录以"data"
而不是"data1"
开头,这迫使我分别读取每条记录。 Then, I convert it to a dict using eval()
, to iterate thru keys and values for a certain path to get to the values I need. 然后,我使用eval()
将其转换为dict,以迭代键和值的某个路径以获取所需的值。 Then, I generate a list of keys and values based on the keys I need. 然后,我根据需要的键生成键和值的列表。 Then, a pd.dataframe()
converts that list into a dataframe which I know how to convert to csv. 然后, pd.dataframe()
将该列表转换为我知道如何转换为csv的数据pd.dataframe()
。 My code that works is below. 我的有效代码如下。 But I am sure there are better ways to do this. 但我相信,有更好的方法可以做到这一点。 Mine scales poorly. 地雷的伸缩性很差。 Thx. 谢谢。
counter=1
k=[]
v=[]
res=[]
m=0
for line in f2:
jline=eval(line)
counter +=1
for items in jline:
k.append(jline[u'data'][0].keys())
v.append(jline[u'data'][0].values())
print 'keys are:', k
i=0
j=0
while i <3 :
while j <3:
if k[i][j]==u'id':
res.append(v[i][j])
j += 1
i += 1
#res is my result set
del k[:]
del v[:]
Changing my_data to be: 将my_data更改为:
my_data = [{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data One
{"id":"xyz2","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data Two
{"id":"xyz3","type":"book","attributes":{"doc_type":"article","action":"cut"}}] # Data Three
You can dump this directly into a dataframe as so: 您可以这样将其直接转储到数据帧中:
mydf = pd.DataFrame(my_data)
It's not clear what your data path would be, but if you are looking for specific combinations of id
, type
, etc. You could explicitly search 尚不清楚您的数据路径是什么,但是如果您要查找id
, type
等的特定组合,则可以显式搜索
def find_my_way(data, pattern):
# pattern = {'id':'someid', 'type':'sometype'...}
res = []
for row in data:
if row.get('id') == pattern.get('id'):
res.append(row)
return row
mydf = pd.DataFrame(find_my_way(mydata, pattern))
EDIT: 编辑:
Without going into how the api works, in pseudo-code, you'll want to do something like the following: 在不讨论api的工作原理的情况下,您将需要执行以下伪代码:
my_objects = []
calls = 0
while calls < maximum:
my_data = call_the_api(params)
data = my_data.get('data')
if not data:
calls+=1
continue
# Api calls to single objects usually return a dictionary, to group objects they return lists. This handles both cases
if isinstance(data, list):
my_objects = [*data, *my_objects]
elif isinstance(data, {}):
my_objects = [{**data}, *my_objects]
# This will unpack the data response into a list that you can then load into a DataFrame with the attributes from the api as the columns
df = pd.DataFrame(my_objects)
Assuming your data from the api looks like: 假设您从api获得的数据如下所示:
"""
{
"links": {},
"meta": {},
"data": {
"type": "FactivaOrganizationsProfile",
"id": "Goog",
"attributes": {
"key_executives": {
"source_provider": [
{
"code": "FACSET",
"descriptor": "FactSet Research Systems Inc.",
"primary": true
}
]
}
},
"relationships": {
"people": {
"data": {
"type": "people",
"id": "39961704"
}
}
}
},
"included": {}
}
"""
per the documentation, which is why I'm using my_data.get('data')
. 根据文档,这就是为什么我使用my_data.get('data')
。
That should get you all of the data (unfiltered) into a DataFrame 那应该使您所有的数据(未经过滤)进入DataFrame
Saving the DataFrame
for the last bit is a bit more memory friendly 将DataFrame
保存为最后一点对内存更友好
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.