如何從嵌套的 JSON 中提取和計算值？

Question

我正在嘗試遍歷 json 列表並從每個 json 返回的字典中提取一些信息。 大約 99% 的情況下，每個 json 字典的第三層包含 5 個 'name' 值，其中 2 個是 xml 文件名。 但是，文件並不是每次都以相同的順序出現，選擇幾次，只有一個xml文件。

在代碼進入第二個循環之前，我構建了一個循環來使用搜索字符串計算 xml 文件的數量。 這確保了我在每個循環中創建的xml_dict具有正確數量的值 (2)。

“預計數器”有效，但確實減慢了執行速度。 無論如何更好地合並xml計數器以提高性能？ 另外，我不知道我是否需要“其他：繼續”。

示例 json 鏈接： https : //www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json

json_list = [all_forms['Link'][x] for x in all_forms.index if all_forms['Form Type'][x] == '13F-HR']
link_list = []
lcounter = 0
for json in json_list:
    decode = requests.get(json).json()
    xml_dict = {}
    xml_count = 0
    for dic in decode['directory']['item'][0:]:
        for v in dic.values(): 
            if ".xml" in v.lower():
                xml_count += 1
            else:
                continue
    for dic in decode['directory']['item'][0:]:
        if "primary_doc.xml" in dic['name'] and xml_count > 1:
            xml_dict['doc_xml'] = json.replace('index.json', '') + dic['name']
        elif ".xml" in dic['name'].lower() and "primary_doc" not in dic['name']:
            xml_dict['hold_xml'] = json.replace('index.json', '') + dic['name']
        else:
            continue
    if xml_dict:
        link_list.append(xml_dict)
    lcounter += 1
    if lcounter % 100 == 0:
        print("Processed {} forms".format(lcounter))

Answer 1

我認為使用帶有矢量化函數的pandas會更容易、更快
- 這是獲得所有計數的 5 行代碼，而且速度很快。
一旦 xml 文件計數可用以及所有.xml文件的路徑，請考慮查看如何將 XML 文件轉換為漂亮的熊貓數據幀？ 自動處理這些文件。

import pandas as pd

# list to index.json for Archives
paths = ['https://www.sec.gov/Archives/edgar/data/1736260/000119312515118890/index.json',
         'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json',
         'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/index.json']

# download and each json and join it into a single dataframe
# reset the index, so each row has a unique index number
df = pd.concat([pd.read_json(path, orient='index') for path in paths]).reset_index()

# item is a list of dictionaries that can be exploded to separate columns
dfe = df.explode('item').reset_index(drop=True)

# each dictionary now has a separate row
# normalize the dicts, so each key is a column name and each value is in the row
# rename 'name' to 'item_name', this is the column containing file names like .xml
# join this back to the main dataframe and drop the item row
dfj = dfe.join(pd.json_normalize(dfe.item).rename(columns={'name': 'item_name'})).drop(columns=['item'])

# find the rows with .xml in item_name
# groupby name, which is the archive path with CIK and Accession Number
# count the number of xml files
dfg = dfj.item_name[dfj.item_name.str.contains('.xml', case=False)].groupby(dfj.name).count().reset_index().rename(columns={'item_name': 'xml_count'})

# display(dfg)
                                              name  xml_count
0  /Archives/edgar/data/1736260/000173626020000004          2
1    /Archives/edgar/data/51143/000104746917001061          6

打印一個數據框，其中包含所有 xml 文件名，並在數據框中具有相應的索引

print(dfj[['name', 'item_name']][dfj.item_name.str.contains('.xml')].reset_index())

[out]:
   index                                             name                item_name
0     43  /Archives/edgar/data/1736260/000173626020000004  cpia2ndqtr202013fhr.xml
1     44  /Archives/edgar/data/1736260/000173626020000004          primary_doc.xml
2     66    /Archives/edgar/data/51143/000104746917001061        FilingSummary.xml
3     74    /Archives/edgar/data/51143/000104746917001061         ibm-20161231.xml
4     76    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_cal.xml
5     77    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_def.xml
6     78    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_lab.xml
7     79    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_pre.xml

僅使用 xml 文件創建一個數據框，並添加一個包含這些文件的完整路徑的列

xml_files = dfj[dfj.item_name.str.contains('.xml', case=False)].copy()

# add a column that creates a full path to the xml files
xml_files['file_path'] = xml_files[['name', 'item_name']].apply(lambda x: f'https://www.sec.gov{x[0]}/{x[1]}', axis=1)

# disply(xml_files)
        index                                             name                    parent-dir        last-modified                item_name      type      size                                                                                   file_path
43  directory  /Archives/edgar/data/1736260/000173626020000004  /Archives/edgar/data/1736260  2020-07-24 09:38:30  cpia2ndqtr202013fhr.xml  text.gif     72804  https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml
44  directory  /Archives/edgar/data/1736260/000173626020000004  /Archives/edgar/data/1736260  2020-07-24 09:38:30          primary_doc.xml  text.gif      1931          https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml
66  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36        FilingSummary.xml  text.gif     91940          https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml
74  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36         ibm-20161231.xml  text.gif  11684003           https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml
76  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_cal.xml  text.gif    185502       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml
77  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_def.xml  text.gif    801568       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml
78  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_lab.xml  text.gif   1356108       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml
79  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_pre.xml  text.gif   1314064       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml

# create a list of just the file paths
path_to_xml_files = xml_files.file_path.tolist()

print(path_to_xml_files)
[out]: 
['https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml',
 'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml']

如何從嵌套的 JSON 中提取和計算值？

問題描述

1 個解決方案

解決方案1
0 已采納 2020-09-15 22:39:38

如何從嵌套的 JSON 中提取和計算值？

問題描述

1 個解決方案

解決方案1 0 已采納 2020-09-15 22:39:38

解決方案1
0 已采納 2020-09-15 22:39:38