简体   繁体   English

将列表的嵌套字典有效地转换为pandas数据框

[英]Convert nested dictionary of lists into pandas dataframe efficiently

I have a json object such that 我有一个这样的json对象

{
   "hits": {
      "hits": [
         {
            "_source": {
               "TYPES": [
                  {
                     "_ID": 130,
                     "_NM": "ARB-130"
                  },
                  {
                     "_ID": 131,
                     "_NM": "ARB-131"
                  },
                  {
                     "_ID": 132,
                     "_NM": "ARB-132"
                  }
               ]
            }
         },
         {
            "_source": {
               "TYPES": [
                  {
                     "_ID": 902,
                     "_NM": "ARB-902"
                  },
                  {
                     "_ID": 903,
                     "_NM": "ARB-903"
                  },
                  {
                     "_ID": 904,
                     "_NM": "ARB-904"
                  }
               ]
            }
         }
      ]
   }
}

I need to unpack it into a pandas dataframe such that I get all the unique _id and _nm pairs under the _types object 我需要将其解包到pandas数据框中,以便在_types对象下获得所有唯一的_id和_nm对

           _ID          _NM
0          130          ARB-130
1          131          ARB-131
2          132          ARB-132
3          902          ARB-902
4          903          ARB-903
5          904          ARB-904

I am looking for the fastest possible solution since the number of types and number of pairs within types can be in hundred of thousands. 我正在寻找最快的解决方案,因为类型数和类型中的对数可能达到数十万。 So my unpacking using pd.Series and using apply makes it slow and I would like to avoid it if possible. 因此,使用pd.Series进行解压缩并使用apply会使速度变慢,如果可能,我想避免这样做。 Any ideas would be appreciated. 任何想法,将不胜感激。 Also about exploding dictionaries or lists in a column into separate columns without using pd.Series as I encounter this use case on the regular 也涉及将字典或一列中的列表分解为单独的列而无需使用pd.Series的情况,因为我经常遇到此用例

One way is to restructure your dictionary and flatten using itertools.chain . 一种方法是重组字典并使用itertools.chain展平。

For performance, you should benchmark with your data. 为了提高性能,您应该以数据为基准。

from itertools import chain

res = list(chain.from_iterable(i['_source']['TYPES'] for i in d['hits']['hits']))

df = pd.DataFrame(res)

print(df)

   _ID      _NM
0  130  ARB-130
1  131  ARB-131
2  132  ARB-132
3  902  ARB-902
4  903  ARB-903
5  904  ARB-904

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM