[英]How to flatten nested dict formatted '_source' column of csv, into dataframe
我有500多個行的csv,其中一列“ _source”存儲為JSON。 我想將其提取到pandas數據框中。 我需要每個鍵成為其自己的列。 #我有一個1 mb的在線社交媒體數據Json文件,我需要將字典和鍵值轉換為它們自己的單獨列。 社交媒體數據來自Facebook,Twitter /網絡爬行等。大約528行獨立的帖子/推文/文本行,每行在詞典中都有許多詞典。 我在下面的Jupyter筆記本電腦上附加了幾個步驟,以提供更完整的理解。 需要將詞典中詞典的所有鍵值對都轉換為數據框內的列。非常感謝,這將是巨大的幫助!!! 我試圖通過這樣做將其更改為數據框
source = pd.DataFrame.from_dict(source, orient='columns')
它返回了類似的內容...我以為可以解開字典的包裝,但沒有。
#source.head()
#_source
#0 {'sub_organization_id': 'default', 'uid': 'aba...
#1 {'sub_organization_id': 'default', 'uid': 'ab0...
#2 {'sub_organization_id': 'default', 'uid': 'ac0...
下面是形狀
#source.shape (528, 1)
實際的“ _source”行如下所示。 有許多字典和鍵:值對,其中每個鍵都必須是其自己的列。 謝謝! 出於隱私原因,實際鏈接已更改/加擾。
{'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
發布之前,請確保實際代碼適用於所附加的數據。 謝謝!
我嘗試了下面的代碼,但是它不起作用,存在語法錯誤,我無法弄清楚。
pd.io.json.json_normalize(source_data。[_ source] .apply(json.loads))
pd.io.json.json_normalize(source_data.[_source].apply(json.loads))
^
SyntaxError: invalid syntax
誰能幫助我,這將是聖人!
前段時間我不得不做類似的事情。 基本上,我使用了一個將json完全弄平的函數,以標識將被轉換為列的鍵,然后遍歷json來重建一行並將每行追加到“結果”數據幀中。 因此,使用您提供的數據,它創建了52列行,並對其進行瀏覽,看起來它已將所有鍵包含在其自己的列中。 嵌套的任何內容,例如: 'meta': {'rule_matcher':[{'atribs': {'website': ...]}
然后,列名稱應為meta.rule_matcher.atribs.website
,其中'.'
表示那些嵌套鍵
data_source = {'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {'rule_matcher': [{'atribs': {'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'},
'results': [{'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {'campaign_title': 'AF',
'project_title': 'AF '}}]}],
'render': [{'attribs': {'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'},
'results': [{'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32}]}]},
'norm_attribs': {'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {'uid': '2ab8f2651cb32261b911c990a8b'},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'},
'type': 'crawl',
'norm': {'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'}}
碼:
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
flat = flatten_json(data_source)
import pandas as pd
import re
results = pd.DataFrame()
special_cols = []
columns_list = list(flat.keys())
for item in columns_list:
try:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
except:
special_cols.append(item)
continue
column = re.findall(r'\_\d+\_(.*)', item )[0]
column = re.sub(r'\_\d+\_', '.', column)
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
for item in special_cols:
results[item] = flat[item]
輸出:
print (results.to_string())
atribs_website atribs_source atribs_version atribs_type results.rule_type results.rule_tag results.description results.project_veid results.campaign_id results.value results.organization_id results.sub_organization_id results.appid results.project_id results.rule_id results.node_id results.metadata_campaign_title results.metadata_project_title attribs_website attribs_version attribs_type results.render_status results.path results.image_hash results.url results.load_time sub_organization_id uid project_veid campaign_id organization_id norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
0 github.com/res Explicit 1.1 crawl hashtag Far NaN A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far NaN NaN ray CDE2F42-5B87-C594-C900E578C 1838 NaN AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/render... bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32.0 default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.