[英]Construct a pandas DataFrame from items in a nested dictionary with lists as inner values
I have a nested dictionary annot_dict
with structure:我有一个嵌套字典annot_dict
的结构:
The values, the list of dictionaries, each have structure:值,字典列表,每个都有结构:
An example of the entire structure is:整个结构的一个例子是:
annot_dict['ID_string'] = [
{'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
{'string2' : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
{'string3' : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
]
The ID_string
is the same as the first sub-dictionary key. ID_string
与第一个子字典键相同。 This is the output of a gff3 file parser function I wrote and the real dictionary information is the genes ( ID_string
) and transcripts ( string2
, string3
,...) from the genome of human chromosome 9, if anyone is familiar with the structure of that file type.这是我写的 gff3 文件解析器 function 的 output,真正的字典信息是来自人类染色体 9 基因组的基因( ID_string
)和转录本( string2
, string3
,...),如果有人熟悉该文件类型。 The attribute lists describe biotype, start index, end index, strand, and description.属性列表描述生物型、开始索引、结束索引、链和描述。
I want to put this information into a pandas DataFrame now.我现在想将此信息放入 pandas DataFrame 中。 I want to loop through the outermost keys (the ID_string
s) in the dict to make one big DataFrame containing a row for each ID_string
and rows for each of its subcategories underneath it ( string2
, string3
).我想遍历 dict 中最外层的键( ID_string
s),以制作一个大的 DataFrame ,其中包含每个ID_string
的行和它下面的每个子类别的行( string2
, string3
)。
I want it to look like this:我希望它看起来像这样:
| subunit_ID | gene_ID | start_index | end_index | strand |biotype | desc |
|------------|-----------|-------------|-----------|--------|--------|--------|
|'ID_string' |'ID_string'| 'attr1a' | 'attr1b' |'attr1c'|'attr1d'|'attr1e'|
| 'string2' |'ID_string'| 'attr2a' | 'attr2b' |'attr2c'|'attr2d'|'attr2e'|
| 'string3' |'ID_string'| 'attr3a' | 'attr3b' |'attr3c'|'attr3d'|'attr3e'|
I did look at other answers but none had quite the same dict structure as I do.我确实看过其他答案,但没有一个与我的字典结构完全相同。 This is my first question on SO so please feel free to improve the understandability of my question.这是我关于 SO 的第一个问题,所以请随时提高我的问题的可理解性。 Thanks in advance.提前致谢。
You could use list comprehendion to flatten the dicts to lists that include the dict keys as items, then load it to pandas:您可以使用列表理解将字典展平为包含字典键作为项目的列表,然后将其加载到 pandas:
import pandas as pd
annot_dict = {}
annot_dict['ID_string'] = [
{'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
{'string2' : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
{'string3' : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
]
df = pd.DataFrame([[k]+list(annot_dict['ID_string'][0].keys())+v for i in annot_dict['ID_string'] for k, v in i.items()], columns=['subunit_ID','gene_ID','start_index','end_index','strand','biotype','desc'])
output: output:
subunit_ID subunit_ID | gene_ID基因ID | start_index开始索引 | end_index end_index | strand股 | biotype生物型 | desc描述 | |
---|---|---|---|---|---|---|---|
0 0 | ID_string ID_字符串 | ID_string ID_字符串 | attr1a属性1a | attr1b属性1b | attr1c属性 | attr1d attr1d | attr1e属性 |
1 1 | string2字符串2 | ID_string ID_字符串 | attr2a属性2a | attr2b attr2b | attr2c attr2c | attr2d attr2d | attr2e属性 |
2 2 | string3字符串3 | ID_string ID_字符串 | attr3a属性3a | attr3b属性3b | attr3c attr3c | attr3d attr3d | attr3e属性 |
You could do:你可以这样做:
df = pd.DataFrame(
(
[subkey, key] + value
for key, records in annot_dict.items()
for record in records
for subkey, value in record.items()
),
columns=[
'subunit_ID', 'gene_ID', 'start_index', 'end_index', 'strand','biotype', 'desc'
]
)
Result for结果为
annot_dict = {
'ID_string1': [
{'ID_string1': ['attr11a', 'attr11b', 'attr11c', 'attr11d', 'attr11e']},
{'string12' : ['attr12a', 'attr12b', 'attr12c', 'attr12d', 'attr12e']},
{'string13' : ['attr13a', 'attr13b', 'attr13c', 'attr13d', 'attr13e']},
],
'ID_string2': [
{'ID_string2': ['attr21a', 'attr21b', 'attr21c', 'attr21d', 'attr21e']},
{'string22' : ['attr22a', 'attr22b', 'attr22c', 'attr22d', 'attr22e']},
{'string23' : ['attr23a', 'attr23b', 'attr23c', 'attr23d', 'attr23e']},
]
}
is是
subunit_ID gene_ID start_index end_index strand biotype desc
0 ID_string1 ID_string1 attr11a attr11b attr11c attr11d attr11e
1 string12 ID_string1 attr12a attr12b attr12c attr12d attr12e
2 string13 ID_string1 attr13a attr13b attr13c attr13d attr13e
3 ID_string2 ID_string2 attr21a attr21b attr21c attr21d attr21e
4 string22 ID_string2 attr22a attr22b attr22c attr22d attr22e
5 string23 ID_string2 attr23a attr23b attr23c attr23d attr23e
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.