简体   繁体   中英

JSON to Pandas DF

I have a dataset (Firewall Logs) from Azure Firewall which I store in Blob Storage in JSON. The JSON resemble below.

{ "category": "AzureFirewallNetworkRule", "time": "2021-01-31T00:00:00.1551130Z", "resourceId": "/SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDERS/MICROSOFT.NETWORK/AZUREFIREWALLS/SEA-DEV", "operationName": "AzureFirewallNetworkRuleLog", "properties": {"msg":"TCP request from 172.16.1.218:54652 to 172.17.1.219:8080. Action: Allow"}}
{ "category": "AzureFirewallNetworkRule", "time": "2021-01-31T00:00:00.1268490Z", "resourceId": "/SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDERS/MICROSOFT.NETWORK/AZUREFIREWALLS/SEA-DEV", "operationName": "AzureFirewallNetworkRuleLog", "properties": {"msg":"UDP request from 172.16.1.218:53067 to 8.8.8.8:53. Action: Allow"}}

Have few millions line in a day to go through to group the source IP again the ports that are allowed or denied, so I see analyzing these data using JN would be feasible.

The issue:

I have try with below code and I ran into problem when trying to flatten the "properties", which i wanted the "msg".

import json
import pandas as pd

# load data using Python JSON module
with open('FWLog/FWLog2.json','r') as f:
    data = json.loads(f.read())
# Flatten data
df_nested_list = pd.json_normalize(data, record_path =['properties'])

The Error:

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-61-3500c0d62d55> in <module>
      7 # load data using Python JSON module
      8 with open('FWLog/FWLog2.json','r') as f:
----> 9     data = json.loads(f.read())
     10 # Flatten data
     11 df_nested_list = pd.json_normalize(data, record_path =['properties'])

~\anaconda3\lib\json\__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    355             parse_int is None and parse_float is None and
    356             parse_constant is None and object_pairs_hook is None and not kw):
--> 357         return _default_decoder.decode(s)
    358     if cls is None:
    359         cls = JSONDecoder

~\anaconda3\lib\json\decoder.py in decode(self, s, _w)
    338         end = _w(s, end).end()
    339         if end != len(s):
--> 340             raise JSONDecodeError("Extra data", s, end)
    341         return obj
    342 

JSONDecodeError: Extra data: line 2 column 1 (char 386)

You can use lines=True in pd.read_json :

df = pd.read_json("your_file.txt", lines=True)
df_final = pd.concat([pd.DataFrame(df.pop("properties").to_list()), df], axis=1)
print(df_final)

Prints:

                                                 msg                  category                          time                                         resourceId                operationName
0  TCP request from 172.16.1.218:54652 to 172.17....  AzureFirewallNetworkRule  2021-01-31T00:00:00.1551130Z  /SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDER...  AzureFirewallNetworkRuleLog
1  UDP request from 172.16.1.218:53067 to 8.8.8.8...  AzureFirewallNetworkRule  2021-01-31T00:00:00.1268490Z  /SUBSCRIPTIONS/RESOURCEGROUPS/SEA-DEV/PROVIDER...  AzureFirewallNetworkRuleLog

You have multiple jsons in file. The error happens in json load.

import json
import pandas as pd

# load data using Python JSON module
with open('test_json.json') as f:
    data = [json.loads(line) for line in f]
# Flatten data
pd.DataFrame([j['properties'] for j in data])
msg
0   TCP request from 172.16.1.218:54652 to 172.17....
1   UDP request from 172.16.1.218:53067 to 8.8.8.8...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM