简体   繁体   English

从嵌套字典中的项目构造一个 pandas DataFrame,列表作为内部值

[英]Construct a pandas DataFrame from items in a nested dictionary with lists as inner values

I have a nested dictionary annot_dict with structure:我有一个嵌套字典annot_dict的结构:

  • key = long unique string key = 长唯一字符串
  • value = list of dictionaries值 = 字典列表

The values, the list of dictionaries, each have structure:值,字典列表,每个都有结构:

  • key = long unique string (a subcategory of the upper dictionary's key) key = 长唯一字符串(上层字典键的子类别)
  • value = list of five string items value = 五个字符串项的列表

An example of the entire structure is:整个结构的一个例子是:

annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

The ID_string is the same as the first sub-dictionary key. ID_string与第一个子字典键相同。 This is the output of a gff3 file parser function I wrote and the real dictionary information is the genes ( ID_string ) and transcripts ( string2 , string3 ,...) from the genome of human chromosome 9, if anyone is familiar with the structure of that file type.这是我写的 gff3 文件解析器 function 的 output,真正的字典信息是来自人类染色体 9 基因组的基因( ID_string )和转录本( string2string3 ,...),如果有人熟悉该文件类型。 The attribute lists describe biotype, start index, end index, strand, and description.属性列表描述生物型、开始索引、结束索引、链和描述。

I want to put this information into a pandas DataFrame now.我现在想将此信息放入 pandas DataFrame 中。 I want to loop through the outermost keys (the ID_string s) in the dict to make one big DataFrame containing a row for each ID_string and rows for each of its subcategories underneath it ( string2 , string3 ).我想遍历 dict 中最外层的键( ID_string s),以制作一个大的 DataFrame ,其中包含每个ID_string的行和它下面的每个子类别的行( string2string3 )。

I want it to look like this:我希望它看起来像这样:

| subunit_ID |  gene_ID  | start_index | end_index | strand |biotype | desc   |
|------------|-----------|-------------|-----------|--------|--------|--------|
|'ID_string' |'ID_string'|  'attr1a'   | 'attr1b'  |'attr1c'|'attr1d'|'attr1e'|
| 'string2'  |'ID_string'|  'attr2a'   | 'attr2b'  |'attr2c'|'attr2d'|'attr2e'|
| 'string3'  |'ID_string'|  'attr3a'   | 'attr3b'  |'attr3c'|'attr3d'|'attr3e'|

I did look at other answers but none had quite the same dict structure as I do.我确实看过其他答案,但没有一个与我的字典结构完全相同。 This is my first question on SO so please feel free to improve the understandability of my question.这是我关于 SO 的第一个问题,所以请随时提高我的问题的可理解性。 Thanks in advance.提前致谢。

You could use list comprehendion to flatten the dicts to lists that include the dict keys as items, then load it to pandas:您可以使用列表理解将字典展平为包含字典键作为项目的列表,然后将其加载到 pandas:

import pandas as pd

annot_dict = {}
annot_dict['ID_string'] = [
     {'ID_string': ['attr1a', 'attr1b', 'attr1c', 'attr1d', 'attr1e']},
     {'string2'  : ['attr2a', 'attr2b', 'attr2c', 'attr2d', 'attr2e']},
     {'string3'  : ['attr3a', 'attr3b', 'attr3c', 'attr3d', 'attr3e']},
             ]

df = pd.DataFrame([[k]+list(annot_dict['ID_string'][0].keys())+v for i in annot_dict['ID_string'] for k, v in i.items()], columns=['subunit_ID','gene_ID','start_index','end_index','strand','biotype','desc'])

output: output:

subunit_ID subunit_ID gene_ID基因ID start_index开始索引 end_index end_index strand biotype生物型 desc描述
0 0 ID_string ID_字符串 ID_string ID_字符串 attr1a属性1a attr1b属性1b attr1c属性 attr1d attr1d attr1e属性
1 1 string2字符串2 ID_string ID_字符串 attr2a属性2a attr2b attr2b attr2c attr2c attr2d attr2d attr2e属性
2 2 string3字符串3 ID_string ID_字符串 attr3a属性3a attr3b属性3b attr3c attr3c attr3d attr3d attr3e属性

You could do:你可以这样做:

df =  pd.DataFrame(
    (
        [subkey, key] + value
        for key, records in annot_dict.items()
        for record in records
        for subkey, value in record.items()
    ),
    columns=[
        'subunit_ID', 'gene_ID', 'start_index', 'end_index', 'strand','biotype', 'desc'
    ]
)

Result for结果为

annot_dict = {
    'ID_string1': [
        {'ID_string1': ['attr11a', 'attr11b', 'attr11c', 'attr11d', 'attr11e']},
        {'string12'  : ['attr12a', 'attr12b', 'attr12c', 'attr12d', 'attr12e']},
        {'string13'  : ['attr13a', 'attr13b', 'attr13c', 'attr13d', 'attr13e']},
    ],
    'ID_string2': [
        {'ID_string2': ['attr21a', 'attr21b', 'attr21c', 'attr21d', 'attr21e']},
        {'string22'  : ['attr22a', 'attr22b', 'attr22c', 'attr22d', 'attr22e']},
        {'string23'  : ['attr23a', 'attr23b', 'attr23c', 'attr23d', 'attr23e']},
    ]
}

is

   subunit_ID     gene_ID start_index end_index   strand  biotype     desc
0  ID_string1  ID_string1     attr11a   attr11b  attr11c  attr11d  attr11e
1    string12  ID_string1     attr12a   attr12b  attr12c  attr12d  attr12e
2    string13  ID_string1     attr13a   attr13b  attr13c  attr13d  attr13e
3  ID_string2  ID_string2     attr21a   attr21b  attr21c  attr21d  attr21e
4    string22  ID_string2     attr22a   attr22b  attr22c  attr22d  attr22e
5    string23  ID_string2     attr23a   attr23b  attr23c  attr23d  attr23e

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM