从嵌套字典中的项目构造一个 pandas DataFrame，列表作为内部值

Question

I have a nested dictionary annot_dict with structure:我有一个嵌套字典annot_dict的结构：

key = long unique string key = 长唯一字符串
value = list of dictionaries值 = 字典列表

The values, the list of dictionaries, each have structure:值，字典列表，每个都有结构：

key = long unique string (a subcategory of the upper dictionary's key) key = 长唯一字符串（上层字典键的子类别）
value = list of five string items value = 五个字符串项的列表

An example of the entire structure is:整个结构的一个例子是：

annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

The ID_string is the same as the first sub-dictionary key. ID_string与第一个子字典键相同。 This is the output of a gff3 file parser function I wrote and the real dictionary information is the genes ( ID_string ) and transcripts ( string2 , string3 ,...) from the genome of human chromosome 9, if anyone is familiar with the structure of that file type.这是我写的 gff3 文件解析器 function 的 output，真正的字典信息是来自人类染色体 9 基因组的基因（ ID_string ）和转录本（ string2 ， string3 ，...），如果有人熟悉该文件类型。 The attribute lists describe biotype, start index, end index, strand, and description.属性列表描述生物型、开始索引、结束索引、链和描述。

I want to put this information into a pandas DataFrame now.我现在想将此信息放入 pandas DataFrame 中。 I want to loop through the outermost keys (the ID_string s) in the dict to make one big DataFrame containing a row for each ID_string and rows for each of its subcategories underneath it ( string2 , string3 ).我想遍历 dict 中最外层的键（ ID_string s），以制作一个大的 DataFrame ，其中包含每个ID_string的行和它下面的每个子类别的行（ string2 ， string3 ）。

I want it to look like this:我希望它看起来像这样：

| subunit_ID |  gene_ID  | start_index | end_index | strand |biotype | desc   |
|------------|-----------|-------------|-----------|--------|--------|--------|
|'ID_string' |'ID_string'|  'attr1a'   | 'attr1b'  |'attr1c'|'attr1d'|'attr1e'|
| 'string2'  |'ID_string'|  'attr2a'   | 'attr2b'  |'attr2c'|'attr2d'|'attr2e'|
| 'string3'  |'ID_string'|  'attr3a'   | 'attr3b'  |'attr3c'|'attr3d'|'attr3e'|

I did look at other answers but none had quite the same dict structure as I do.我确实看过其他答案，但没有一个与我的字典结构完全相同。 This is my first question on SO so please feel free to improve the understandability of my question.这是我关于 SO 的第一个问题，所以请随时提高我的问题的可理解性。 Thanks in advance.提前致谢。

Answer 1

You could use list comprehendion to flatten the dicts to lists that include the dict keys as items, then load it to pandas:您可以使用列表理解将字典展平为包含字典键作为项目的列表，然后将其加载到 pandas：

import pandas as pd

annot_dict = {}
annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

df = pd.DataFrame([[k]+list(annot_dict['ID_string'][0].keys())+v for i in annot_dict['ID_string'] for k, v in i.items()], columns=['subunit_ID','gene_ID','start_index','end_index','strand','biotype','desc'])

output: output：

	subunit_ID subunit_ID	gene_ID基因ID	start_index开始索引	end_index end_index	strand股	biotype生物型	desc描述
0 0	ID_string ID_字符串	ID_string ID_字符串	attr1a属性1a	attr1b属性1b	attr1c属性	attr1d attr1d	attr1e属性
1 1	string2字符串2	ID_string ID_字符串	attr2a属性2a	attr2b attr2b	attr2c attr2c	attr2d attr2d	attr2e属性
2 2	string3字符串3	ID_string ID_字符串	attr3a属性3a	attr3b属性3b	attr3c attr3c	attr3d attr3d	attr3e属性

Answer 2

You could do:你可以这样做：

df =  pd.DataFrame(
    (
        [subkey, key] + value
        for key, records in annot_dict.items()
        for record in records
        for subkey, value in record.items()
    ),
    columns=[
        'subunit_ID', 'gene_ID', 'start_index', 'end_index', 'strand','biotype', 'desc'
    ]
)

Result for结果为

annot_dict = {
    'ID_string1': [
        {'ID_string1': ['attr11a', 'attr11b', 'attr11c', 'attr11d', 'attr11e']},
        {'string12'  : ['attr12a', 'attr12b', 'attr12c', 'attr12d', 'attr12e']},
        {'string13'  : ['attr13a', 'attr13b', 'attr13c', 'attr13d', 'attr13e']},
    ],
    'ID_string2': [
        {'ID_string2': ['attr21a', 'attr21b', 'attr21c', 'attr21d', 'attr21e']},
        {'string22'  : ['attr22a', 'attr22b', 'attr22c', 'attr22d', 'attr22e']},
        {'string23'  : ['attr23a', 'attr23b', 'attr23c', 'attr23d', 'attr23e']},
    ]
}

is是

   subunit_ID     gene_ID start_index end_index   strand  biotype     desc
0  ID_string1  ID_string1     attr11a   attr11b  attr11c  attr11d  attr11e
1    string12  ID_string1     attr12a   attr12b  attr12c  attr12d  attr12e
2    string13  ID_string1     attr13a   attr13b  attr13c  attr13d  attr13e
3  ID_string2  ID_string2     attr21a   attr21b  attr21c  attr21d  attr21e
4    string22  ID_string2     attr22a   attr22b  attr22c  attr22d  attr22e
5    string23  ID_string2     attr23a   attr23b  attr23c  attr23d  attr23e

从嵌套字典中的项目构造一个 pandas DataFrame，列表作为内部值

问题描述

2 个解决方案

解决方案1
0 2021-11-20 19:47:22

解决方案2
0 2021-11-20 22:39:09

从嵌套字典中的项目构造一个 pandas DataFrame，列表作为内部值

问题描述

2 个解决方案

解决方案1 0 2021-11-20 19:47:22

解决方案2 0 2021-11-20 22:39:09

解决方案1
0 2021-11-20 19:47:22

解决方案2
0 2021-11-20 22:39:09