[英]How to extract and count values from a nested JSON?
I'm trying to loop through a list of jsons and extract some information from a dictionary of dictionaries that each json returns.我正在尝试遍历 json 列表并从每个 json 返回的字典中提取一些信息。 About 99% of the time, the third layer of each json dictionary contains 5 'name' values, 2 of which are xml file names.大约 99% 的情况下,每个 json 字典的第三层包含 5 个 'name' 值,其中 2 个是 xml 文件名。 However, the files do not appear in the same order every time and a select few times, there is only one xml file.但是,文件并不是每次都以相同的顺序出现,选择几次,只有一个xml文件。
I built in a loop to count the number of xml files using a search string before the code proceeds to a second loop.在代码进入第二个循环之前,我构建了一个循环来使用搜索字符串计算 xml 文件的数量。 This ensures the xml_dict
I'm creating in each loop has the correct amount of values (2).这确保了我在每个循环中创建的xml_dict
具有正确数量的值 (2)。
The "pre-counter" works, but really slows down the execution. “预计数器”有效,但确实减慢了执行速度。 Is there anyway to better incorporate the xml counter to speed up performance?无论如何更好地合并xml计数器以提高性能? Also, I don't know if I need the 'else: continue's.另外,我不知道我是否需要“其他:继续”。
Example json link: https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json示例 json 链接: https : //www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json
json_list = [all_forms['Link'][x] for x in all_forms.index if all_forms['Form Type'][x] == '13F-HR']
link_list = []
lcounter = 0
for json in json_list:
decode = requests.get(json).json()
xml_dict = {}
xml_count = 0
for dic in decode['directory']['item'][0:]:
for v in dic.values():
if ".xml" in v.lower():
xml_count += 1
else:
continue
for dic in decode['directory']['item'][0:]:
if "primary_doc.xml" in dic['name'] and xml_count > 1:
xml_dict['doc_xml'] = json.replace('index.json', '') + dic['name']
elif ".xml" in dic['name'].lower() and "primary_doc" not in dic['name']:
xml_dict['hold_xml'] = json.replace('index.json', '') + dic['name']
else:
continue
if xml_dict:
link_list.append(xml_dict)
lcounter += 1
if lcounter % 100 == 0:
print("Processed {} forms".format(lcounter))
pandas
with vectorized functions我认为使用带有矢量化函数的pandas
会更容易、更快
.xml
files, consider looking at How to convert an XML file to nice pandas dataframe?一旦 xml 文件计数可用以及所有.xml
文件的路径,请考虑查看如何将 XML 文件转换为漂亮的熊猫数据帧? to automate processing of those files.自动处理这些文件。import pandas as pd
# list to index.json for Archives
paths = ['https://www.sec.gov/Archives/edgar/data/1736260/000119312515118890/index.json',
'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/index.json',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/index.json']
# download and each json and join it into a single dataframe
# reset the index, so each row has a unique index number
df = pd.concat([pd.read_json(path, orient='index') for path in paths]).reset_index()
# item is a list of dictionaries that can be exploded to separate columns
dfe = df.explode('item').reset_index(drop=True)
# each dictionary now has a separate row
# normalize the dicts, so each key is a column name and each value is in the row
# rename 'name' to 'item_name', this is the column containing file names like .xml
# join this back to the main dataframe and drop the item row
dfj = dfe.join(pd.json_normalize(dfe.item).rename(columns={'name': 'item_name'})).drop(columns=['item'])
# find the rows with .xml in item_name
# groupby name, which is the archive path with CIK and Accession Number
# count the number of xml files
dfg = dfj.item_name[dfj.item_name.str.contains('.xml', case=False)].groupby(dfj.name).count().reset_index().rename(columns={'item_name': 'xml_count'})
# display(dfg)
name xml_count
0 /Archives/edgar/data/1736260/000173626020000004 2
1 /Archives/edgar/data/51143/000104746917001061 6
print(dfj[['name', 'item_name']][dfj.item_name.str.contains('.xml')].reset_index())
[out]:
index name item_name
0 43 /Archives/edgar/data/1736260/000173626020000004 cpia2ndqtr202013fhr.xml
1 44 /Archives/edgar/data/1736260/000173626020000004 primary_doc.xml
2 66 /Archives/edgar/data/51143/000104746917001061 FilingSummary.xml
3 74 /Archives/edgar/data/51143/000104746917001061 ibm-20161231.xml
4 76 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_cal.xml
5 77 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_def.xml
6 78 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_lab.xml
7 79 /Archives/edgar/data/51143/000104746917001061 ibm-20161231_pre.xml
xml_files = dfj[dfj.item_name.str.contains('.xml', case=False)].copy()
# add a column that creates a full path to the xml files
xml_files['file_path'] = xml_files[['name', 'item_name']].apply(lambda x: f'https://www.sec.gov{x[0]}/{x[1]}', axis=1)
# disply(xml_files)
index name parent-dir last-modified item_name type size file_path
43 directory /Archives/edgar/data/1736260/000173626020000004 /Archives/edgar/data/1736260 2020-07-24 09:38:30 cpia2ndqtr202013fhr.xml text.gif 72804 https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml
44 directory /Archives/edgar/data/1736260/000173626020000004 /Archives/edgar/data/1736260 2020-07-24 09:38:30 primary_doc.xml text.gif 1931 https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml
66 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 FilingSummary.xml text.gif 91940 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml
74 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231.xml text.gif 11684003 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml
76 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_cal.xml text.gif 185502 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml
77 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_def.xml text.gif 801568 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml
78 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_lab.xml text.gif 1356108 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml
79 directory /Archives/edgar/data/51143/000104746917001061 /Archives/edgar/data/51143 2017-02-28 16:23:36 ibm-20161231_pre.xml text.gif 1314064 https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml
# create a list of just the file paths
path_to_xml_files = xml_files.file_path.tolist()
print(path_to_xml_files)
[out]:
['https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/cpia2ndqtr202013fhr.xml',
'https://www.sec.gov/Archives/edgar/data/1736260/000173626020000004/primary_doc.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/FilingSummary.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_cal.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_def.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_lab.xml',
'https://www.sec.gov/Archives/edgar/data/51143/000104746917001061/ibm-20161231_pre.xml']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.