简体   繁体   English

如何从嵌套的 JSON 中提取和计算值?

[英]How to extract and count values from a nested JSON?

I'm trying to loop through a list of jsons and extract some information from a dictionary of dictionaries that each json returns.我正在尝试遍历 json 列表并从每个 json 返回的字典中提取一些信息。 About 99% of the time, the third layer of each json dictionary contains 5 'name' values, 2 of which are xml file names.大约 99% 的情况下,每个 json 字典的第三层包含 5 个 'name' 值,其中 2 个是 xml 文件名。 However, the files do not appear in the same order every time and a select few times, there is only one xml file.但是,文件并不是每次都以相同的顺序出现,选择几次,只有一个xml文件。

I built in a loop to count the number of xml files using a search string before the code proceeds to a second loop.在代码进入第二个循环之前,我构建了一个循环来使用搜索字符串计算 xml 文件的数量。 This ensures the xml_dict I'm creating in each loop has the correct amount of values (2).这确保了我在每个循环中创建的xml_dict具有正确数量的值 (2)。

The "pre-counter" works, but really slows down the execution. “预计数器”有效,但确实减慢了执行速度。 Is there anyway to better incorporate the xml counter to speed up performance?无论如何更好地合并xml计数器以提高性能? Also, I don't know if I need the 'else: continue's.另外,我不知道我是否需要“其他:继续”。

Example json link: https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json示例 json 链接: https : //www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json

json_list = [all_forms['Link'][x] for x in all_forms.index if all_forms['Form Type'][x] == '13F-HR']
link_list = []
lcounter = 0
for json in json_list:
    decode = requests.get(json).json()
    xml_dict = {}
    xml_count = 0
    for dic in decode['directory']['item'][0:]:
        for v in dic.values(): 
            if ".xml" in v.lower():
                xml_count += 1
            else:
                continue
    for dic in decode['directory']['item'][0:]:
        if "primary_doc.xml" in dic['name'] and xml_count > 1:
            xml_dict['doc_xml'] = json.replace('index.json', '') + dic['name']
        elif ".xml" in dic['name'].lower() and "primary_doc" not in dic['name']:
            xml_dict['hold_xml'] = json.replace('index.json', '') + dic['name']
        else:
            continue
    if xml_dict:
        link_list.append(xml_dict)
    lcounter += 1
    if lcounter % 100 == 0:
        print("Processed {} forms".format(lcounter))
  • I think it will be easier and faster to use pandas with vectorized functions我认为使用带有矢量化函数的pandas会更容易、更快
    • This is 5 lines of code to get all the counts and it's fast.这是获得所有计数的 5 行代码,而且速度很快。
  • Once the xml file counts are available and the paths to all the .xml files, consider looking at How to convert an XML file to nice pandas dataframe?一旦 xml 文件计数可用以及所有.xml文件的路径,请考虑查看如何将 XML 文件转换为漂亮的熊猫数据帧? to automate processing of those files.自动处理这些文件。
import pandas as pd

# list to index.json for Archives
paths = ['https://www.sec.gov/Archives/edgar/data/1736260/000119312515118890/index.json',
         'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json',
         'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/index.json']

# download and each json and join it into a single dataframe
# reset the index, so each row has a unique index number
df = pd.concat([pd.read_json(path, orient='index') for path in paths]).reset_index()

# item is a list of dictionaries that can be exploded to separate columns
dfe = df.explode('item').reset_index(drop=True)

# each dictionary now has a separate row
# normalize the dicts, so each key is a column name and each value is in the row
# rename 'name' to 'item_name', this is the column containing file names like .xml
# join this back to the main dataframe and drop the item row
dfj = dfe.join(pd.json_normalize(dfe.item).rename(columns={'name': 'item_name'})).drop(columns=['item'])

# find the rows with .xml in item_name
# groupby name, which is the archive path with CIK and Accession Number
# count the number of xml files
dfg = dfj.item_name[dfj.item_name.str.contains('.xml', case=False)].groupby(dfj.name).count().reset_index().rename(columns={'item_name': 'xml_count'})

# display(dfg)
                                              name  xml_count
0  /Archives/edgar/data/1736260/000173626020000004          2
1    /Archives/edgar/data/51143/000104746917001061          6
  • print a dataframe with all the xml file names with the corresponding index in the dataframe打印一个数据框,其中包含所有 xml 文件名,并在数据框中具有相应的索引
print(dfj[['name', 'item_name']][dfj.item_name.str.contains('.xml')].reset_index())

[out]:
   index                                             name                item_name
0     43  /Archives/edgar/data/1736260/000173626020000004  cpia2ndqtr202013fhr.xml
1     44  /Archives/edgar/data/1736260/000173626020000004          primary_doc.xml
2     66    /Archives/edgar/data/51143/000104746917001061        FilingSummary.xml
3     74    /Archives/edgar/data/51143/000104746917001061         ibm-20161231.xml
4     76    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_cal.xml
5     77    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_def.xml
6     78    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_lab.xml
7     79    /Archives/edgar/data/51143/000104746917001061     ibm-20161231_pre.xml
  • create a dataframe with just the xml files and add a column with a full path to those files仅使用 xml 文件创建一个数据框,并添加一个包含这些文件的完整路径的列
xml_files = dfj[dfj.item_name.str.contains('.xml', case=False)].copy()

# add a column that creates a full path to the xml files
xml_files['file_path'] = xml_files[['name', 'item_name']].apply(lambda x: f'https://www.sec.gov{x[0]}/{x[1]}', axis=1)

# disply(xml_files)
        index                                             name                    parent-dir        last-modified                item_name      type      size                                                                                   file_path
43  directory  /Archives/edgar/data/1736260/000173626020000004  /Archives/edgar/data/1736260  2020-07-24 09:38:30  cpia2ndqtr202013fhr.xml  text.gif     72804  https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml
44  directory  /Archives/edgar/data/1736260/000173626020000004  /Archives/edgar/data/1736260  2020-07-24 09:38:30          primary_doc.xml  text.gif      1931          https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml
66  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36        FilingSummary.xml  text.gif     91940          https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml
74  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36         ibm-20161231.xml  text.gif  11684003           https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml
76  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_cal.xml  text.gif    185502       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml
77  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_def.xml  text.gif    801568       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml
78  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_lab.xml  text.gif   1356108       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml
79  directory    /Archives/edgar/data/51143/000104746917001061    /Archives/edgar/data/51143  2017-02-28 16:23:36     ibm-20161231_pre.xml  text.gif   1314064       https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml

# create a list of just the file paths
path_to_xml_files = xml_files.file_path.tolist()

print(path_to_xml_files)
[out]: 
['https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml',
 'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml',
 'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM