繁体   English   中英

如何将嵌套字典的pandas列展平到每个键的单独列中

[英]How to flatten a pandas column of nested dicts, into separate columns for each key

我有500多个行的csv,其中一列“ _source”存储为JSON。 我想将其提取到pandas数据框中。 我需要每个键成为其自己的列。

我有一个1mb的在线社交媒体数据JSON文件,我需要将字典和键值转换为它们自己的单独列。 社交媒体数据来自Facebook,Twitter /网络抓取等。

大约有528个独立的帖子/推文/文本行,每行在词典中都有许多词典。

我在下面的Jupyter笔记本电脑上附加了几个步骤,以提供更完整的理解。 我需要将字典中所有字典的所有键值对都转换为数据框内的列。

我试图通过这样做将其更改为数据框

source = pd.DataFrame.from_dict(source, orient='columns')

它返回了类似这样的内容...我以为可以解开字典的包装,但事实并非如此。

source.head()

_source
0   {'sub_organization_id': 'default', 'uid': 'aba...
1   {'sub_organization_id': 'default', 'uid': 'ab0...
2   {'sub_organization_id': 'default', 'uid': 'ac0...

下面是形状

source.shape
(528, 1)

以下是“ _source”的示例行。 有许多字典和键:值对,其中每个键都必须是其自己的列。

{
    'sub_organization_id': 'default',
    'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
    'project_veid': 'default',
    'campaign_id': 'default',
    'organization_id': 'default',
    'meta': {
        'rule_matcher': [{
                'atribs': {
                    'website': 'github.com/res',
                    'source': 'Explicit',
                    'version': '1.1',
                    'type': 'crawl'
                },
                'results': [{
                        'rule_type': 'hashtag',
                        'rule_tag': 'Far',
                        'description': None,
                        'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
                        'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
                        'value': '#Far',
                        'organization_id': None,
                        'sub_organization_id': None,
                        'appid': 'ray',
                        'project_id': 'CDE2F42-5B87-C594-C900E578C',
                        'rule_id': '1838',
                        'node_id': None,
                        'metadata': {
                            'campaign_title': 'AF',
                            'project_title': 'AF '
                        }
                    }
                ]
            }
        ],
        'render': [{
                'attribs': {
                    'website': 'github.com/res',
                    'version': '1.0',
                    'type': 'Page Render'
                },
                'results': [{
                        'render_status': 'success',
                        'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
                        'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
                        'url': 'https://discooprdapp.com/',
                        'load_time': 32
                    }
                ]
            }
        ]
    },
    'norm_attribs': {
        'website': 'github.com/res',
        'version': '1.1',
        'type': 'crawl'
    },
    'project_id': 'default',
    'system_timestamp': '2019-02-22T19:04:53.569623',
    'doc': {
        'appid': 'subtter',
        'links': [],
        'response_url': 'https://discooprdapp.com',
        'url': 'https://discooprdapp.com/',
        'status_code': 200,
        'status_msg': 'OK',
        'encoding': 'utf-8',
        'attrs': {
            'uid': '2ab8f2651cb32261b911c990a8b'
        },
        'timestamp': '2019-02-22T19:04:53.963',
        'crawlid': '7fd95-785-4dd259-fcc-8752f'
    },
    'type': 'crawl',
    'norm': {
        'body': '\n',
        'domain': 'discordapp.com',
        'author': 'crawl',
        'url': 'https://discooprdapp.com',
        'timestamp': '2019-02-22T19:04:53.961283+00:00',
        'id': '7fc5-685-4dd9-cc-8762f'
    }
}

转到_source

_source list

  • 给定问题的样本数据
    • 创建_source中所有行的list

在此处输入图片说明

_source_list = df._source.tolist()

使用递归展平嵌套的dicts

def flatten_json(nested_json: dict, exclude: list=['']) -> dict:
    """
    Flatten a list of nested dicts.
    """
    out = dict()
    def flatten(x: (list, dict, str), name: str='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude:
                    flatten(x[a], f'{name}{a}_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, f'{name}{i}_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

使用flatten_json

df_source = pd.DataFrame([flatten_json(x) for x in _source_list])
  • 在这种情况下,最终结果将是一个具有52列的数据框
sub_organization_id                                             uid project_veid campaign_id organization_id meta_rule_matcher_0_atribs_website meta_rule_matcher_0_atribs_source meta_rule_matcher_0_atribs_version meta_rule_matcher_0_atribs_type meta_rule_matcher_0_results_0_rule_type meta_rule_matcher_0_results_0_rule_tag meta_rule_matcher_0_results_0_description meta_rule_matcher_0_results_0_project_veid meta_rule_matcher_0_results_0_campaign_id meta_rule_matcher_0_results_0_value meta_rule_matcher_0_results_0_organization_id meta_rule_matcher_0_results_0_sub_organization_id meta_rule_matcher_0_results_0_appid meta_rule_matcher_0_results_0_project_id meta_rule_matcher_0_results_0_rule_id meta_rule_matcher_0_results_0_node_id meta_rule_matcher_0_results_0_metadata_campaign_title meta_rule_matcher_0_results_0_metadata_project_title meta_render_0_attribs_website meta_render_0_attribs_version meta_render_0_attribs_type meta_render_0_results_0_render_status                                                             meta_render_0_results_0_path meta_render_0_results_0_image_hash meta_render_0_results_0_url  meta_render_0_results_0_load_time norm_attribs_website norm_attribs_version norm_attribs_type project_id            system_timestamp doc_appid          doc_response_url                    doc_url  doc_status_code doc_status_msg doc_encoding                doc_attrs_uid            doc_timestamp                 doc_crawlid   type norm_body     norm_domain norm_author                  norm_url                    norm_timestamp                 norm_id
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
            default  ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b      default     default         default                     github.com/res                          Explicit                                1.1                           crawl                                 hashtag                                    Far                                      None               A7180EA-7078-0C7F-ED5D-86AD7              2A6DA0C-365BB-67DD-B05830920                                #Far                                          None                                              None                                 ray              CDE2F42-5B87-C594-C900E578C                                  1838                                  None                                                    AF                                                  AF                 github.com/res                           1.0                Page Render                               success  https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg    bb7674b8ea3fc05bfd027a19815f82c   https://discooprdapp.com/                                 32       github.com/res                  1.1             crawl    default  2019-02-22T19:04:53.569623   subtter  https://discooprdapp.com  https://discooprdapp.com/              200             OK        utf-8  2ab8f2651cb32261b911c990a8b  2019-02-22T19:04:53.963  7fd95-785-4dd259-fcc-8752f  crawl        \n  discordapp.com       crawl  https://discooprdapp.com  2019-02-22T19:04:53.961283+00:00  7fc5-685-4dd9-cc-8762f
pd.io.json.json_normalize(source.columnName.apply(json.loads))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM