[英]JSON to Pandas DF
我有一个来自 Azure 防火墙的数据集(防火墙日志),我将其存储在 JSON 的 Blob 存储中。 JSON 如下所示。
{ "category": "AzureFirewallNetworkRule", "time": "2021-01-31T00:00:00.1551130Z", "resourceId": "/SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDERS/MICROSOFT.NETWORK/AZUREFIREWALLS/SEA-DEV", "operationName": "AzureFirewallNetworkRuleLog", "properties": {"msg":"TCP request from 172.16.1.218:54652 to 172.17.1.219:8080. Action: Allow"}}
{ "category": "AzureFirewallNetworkRule", "time": "2021-01-31T00:00:00.1268490Z", "resourceId": "/SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDERS/MICROSOFT.NETWORK/AZUREFIREWALLS/SEA-DEV", "operationName": "AzureFirewallNetworkRuleLog", "properties": {"msg":"UDP request from 172.16.1.218:53067 to 8.8.8.8:53. Action: Allow"}}
每天有几百万行到 go 通过将源 IP 再次分组允许或拒绝的端口,所以我看到使用 JN 分析这些数据是可行的。
问题:
我尝试使用下面的代码,但在尝试展平我想要的“msg”的“属性”时遇到了问题。
import json
import pandas as pd
# load data using Python JSON module
with open('FWLog/FWLog2.json','r') as f:
data = json.loads(f.read())
# Flatten data
df_nested_list = pd.json_normalize(data, record_path =['properties'])
错误:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-61-3500c0d62d55> in <module>
7 # load data using Python JSON module
8 with open('FWLog/FWLog2.json','r') as f:
----> 9 data = json.loads(f.read())
10 # Flatten data
11 df_nested_list = pd.json_normalize(data, record_path =['properties'])
~\anaconda3\lib\json\__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
355 parse_int is None and parse_float is None and
356 parse_constant is None and object_pairs_hook is None and not kw):
--> 357 return _default_decoder.decode(s)
358 if cls is None:
359 cls = JSONDecoder
~\anaconda3\lib\json\decoder.py in decode(self, s, _w)
338 end = _w(s, end).end()
339 if end != len(s):
--> 340 raise JSONDecodeError("Extra data", s, end)
341 return obj
342
JSONDecodeError: Extra data: line 2 column 1 (char 386)
您可以在pd.read_json
中使用lines=True
:
df = pd.read_json("your_file.txt", lines=True)
df_final = pd.concat([pd.DataFrame(df.pop("properties").to_list()), df], axis=1)
print(df_final)
印刷:
msg category time resourceId operationName
0 TCP request from 172.16.1.218:54652 to 172.17.... AzureFirewallNetworkRule 2021-01-31T00:00:00.1551130Z /SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDER... AzureFirewallNetworkRuleLog
1 UDP request from 172.16.1.218:53067 to 8.8.8.8... AzureFirewallNetworkRule 2021-01-31T00:00:00.1268490Z /SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDER... AzureFirewallNetworkRuleLog
您的文件中有多个 json。 该错误发生在 json 负载中。
import json
import pandas as pd
# load data using Python JSON module
with open('test_json.json') as f:
data = [json.loads(line) for line in f]
# Flatten data
pd.DataFrame([j['properties'] for j in data])
msg
0 TCP request from 172.16.1.218:54652 to 172.17....
1 UDP request from 172.16.1.218:53067 to 8.8.8.8...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.