简体   繁体   中英

Create a dataframe from a dict where values are variable-length lists

I have a dict where the values are is a list, for example;

my_dict = {1: [964725688, 6928857],
           ...

           22: [1667906, 35207807, 685530997, 35207807],
           ...
           }

In this example, the max items in a list is 4, but it could be greater than that.

I would like to convert it to a dataframe like:

1  964725688
1  6928857
...
22 1667906
22 35207807
22 685530997
22 35207807
my_dict ={1: [964725688, 6928857], 22: [1667906, 35207807, 685530997, 35207807]}

df = pd.DataFrame( [ [k,ele] for k,v in my_dict.iteritems() for ele in v ])

print df

   0   1        
0   1  964725688
1   1    6928857
2  22    1667906
3  22   35207807
4  22  685530997
5  22   35207807

First Idea
pandas

s = pd.Series(my_dict)
pd.Series(
    np.concatenate(s.values),
    s.index.repeat(s.str.len())
)

1     964725688
1       6928857
22      1667906
22     35207807
22    685530997
22     35207807
dtype: int64

Faster!
numpy

values = list(my_dict.values())
lens = [len(value) for value in values]
keys = list(my_dict.keys())
pd.Series(np.concatenate(values), np.repeat(keys, lens))

1     964725688
1       6928857
22      1667906
22     35207807
22    685530997
22     35207807
dtype: int64

Interesting
pd.concat

pd.concat({k: pd.Series(v) for k, v in my_dict.items()}).reset_index(1, drop=True)

1     964725688
1       6928857
22      1667906
22     35207807
22    685530997
22     35207807
dtype: int64

Slightly on the functional side using zip and reduce :

from functools import reduce  # if working with Python3
import pandas as pd


d = {1: [964725688, 6928857], 22: [1667906, 35207807, 685530997, 35207807]}

df = pd.DataFrame(reduce(lambda x,y: x+y, [list(zip([k]*len(v), v)) for k,v in d.items()]))

print(df)

#     0          1
# 0   1  964725688
# 1   1    6928857
# 2  22    1667906
# 3  22   35207807
# 4  22  685530997
# 5  22   35207807

We zip the keys and the values to create records (extended through a reduce operation). The records are then passed to the pd.DataFrame function.

I hope this helps.

#Load dict directly to a Dataframe without loops
df=pd.DataFrame.from_dict(my_dict,orient='index')

#Unstack, drop na and sort if you need.
df.unstack().dropna().sort_index(level=1)
Out[382]: 
0  1     964725688.0
1  1       6928857.0
0  22      1667906.0
1  22     35207807.0
2  22    685530997.0
3  22     35207807.0
dtype: float64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM