I have a dataset (Firewall Logs) from Azure Firewall which I store in Blob Storage in JSON. The JSON resemble below.
{ "category": "AzureFirewallNetworkRule", "time": "2021-01-31T00:00:00.1551130Z", "resourceId": "/SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDERS/MICROSOFT.NETWORK/AZUREFIREWALLS/SEA-DEV", "operationName": "AzureFirewallNetworkRuleLog", "properties": {"msg":"TCP request from 172.16.1.218:54652 to 172.17.1.219:8080. Action: Allow"}}
{ "category": "AzureFirewallNetworkRule", "time": "2021-01-31T00:00:00.1268490Z", "resourceId": "/SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDERS/MICROSOFT.NETWORK/AZUREFIREWALLS/SEA-DEV", "operationName": "AzureFirewallNetworkRuleLog", "properties": {"msg":"UDP request from 172.16.1.218:53067 to 8.8.8.8:53. Action: Allow"}}
Have few millions line in a day to go through to group the source IP again the ports that are allowed or denied, so I see analyzing these data using JN would be feasible.
The issue:
I have try with below code and I ran into problem when trying to flatten the "properties", which i wanted the "msg".
import json
import pandas as pd
# load data using Python JSON module
with open('FWLog/FWLog2.json','r') as f:
data = json.loads(f.read())
# Flatten data
df_nested_list = pd.json_normalize(data, record_path =['properties'])
The Error:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-61-3500c0d62d55> in <module>
7 # load data using Python JSON module
8 with open('FWLog/FWLog2.json','r') as f:
----> 9 data = json.loads(f.read())
10 # Flatten data
11 df_nested_list = pd.json_normalize(data, record_path =['properties'])
~\anaconda3\lib\json\__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
355 parse_int is None and parse_float is None and
356 parse_constant is None and object_pairs_hook is None and not kw):
--> 357 return _default_decoder.decode(s)
358 if cls is None:
359 cls = JSONDecoder
~\anaconda3\lib\json\decoder.py in decode(self, s, _w)
338 end = _w(s, end).end()
339 if end != len(s):
--> 340 raise JSONDecodeError("Extra data", s, end)
341 return obj
342
JSONDecodeError: Extra data: line 2 column 1 (char 386)
You can use lines=True
in pd.read_json
:
df = pd.read_json("your_file.txt", lines=True)
df_final = pd.concat([pd.DataFrame(df.pop("properties").to_list()), df], axis=1)
print(df_final)
Prints:
msg category time resourceId operationName
0 TCP request from 172.16.1.218:54652 to 172.17.... AzureFirewallNetworkRule 2021-01-31T00:00:00.1551130Z /SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDER... AzureFirewallNetworkRuleLog
1 UDP request from 172.16.1.218:53067 to 8.8.8.8... AzureFirewallNetworkRule 2021-01-31T00:00:00.1268490Z /SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDER... AzureFirewallNetworkRuleLog
You have multiple jsons in file. The error happens in json load.
import json
import pandas as pd
# load data using Python JSON module
with open('test_json.json') as f:
data = [json.loads(line) for line in f]
# Flatten data
pd.DataFrame([j['properties'] for j in data])
msg
0 TCP request from 172.16.1.218:54652 to 172.17....
1 UDP request from 172.16.1.218:53067 to 8.8.8.8...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.